IncrementalIndex
IncrementalIndex类有两个重要的成员metricDescs和dimensionDescs
private final Map<String, MetricDesc> metricDescs;
private final Map<String, DimensionDesc> dimensionDescs;
metricDescs和dimensionDescs在IncrementalIndex的构造函数中被初始化。
MetricDesc
每个MetricDesc中有几个重要的成员:
private final int index; // metric序号
private final String name; // metric名字
private final String type: // metric类型
private final ColumnCapabilitiesImpl capabilities // metric能力
MetricDesc的构造函数:
public MetricDesc(int index, AggregatorFactory factory)
{
this.index = index;
this.name = factory.getName();
String typeInfo = factory.getTypeName();
this.capabilities = new ColumnCapabilitiesImpl();
if ("float".equalsIgnoreCase(typeInfo)) {
capabilities.setType(ValueType.FLOAT);
this.type = typeInfo;
} else if ("long".equalsIgnoreCase(typeInfo)) {
capabilities.setType(ValueType.LONG);
this.type = typeInfo;
} else if ("double".equalsIgnoreCase(typeInfo)) {
capabilities.setType(ValueType.DOUBLE);
this.type = typeInfo;
} else {
capabilities.setType(ValueType.COMPLEX);
this.type = ComplexMetrics.getSerdeForType(typeInfo).getTypeName();
}
}
每个AggregatorFactory的实例都有一个名字,通过getTypeName()方法获取。这是接口中的函数,实现AggregatorFactory需要实现这个函数。比如CountAggregatorFactory的getTypeName()方法返回“long”,HyperUniquesAggregatorFactory的getTypeName返回的是“hyperUnique”。
如果对AggregatorFactory调用getTypeName返回的名字不是”float“,”long“,”double“普通的类型,而是复杂的自定义的类型,例如HyperUniqueAggregatorFactory或者Datasketch。
IncrementalIndex中通过如下代码构造每个metric的MetricDesc:
for (AggregatorFactory metric : metrics) {
MetricDesc metricDesc = new MetricDesc(metricDesc.size(), metric);
metricDescs.put(metricDesc.getName(), metricDesc);
}
DimensionDesc
每个DimensionDesc中有几个重要的成员:
private final int index; // dimension序号
private final String name; // dimnesion名字
private final ColumnCapabilitiesImpl capabilities // dimension能力
private final DimensionHandler handler;
private final DimensionIndexer indexer;
DimensionHandller
DimensionHandller对象封装了特定于某一个dimension的索引。列合并,创建,以及查询操作。这些操作
通过DimensionHandler方法创建的对象(DimensionIndexer通过makeIndexer创建),
DimensionMerge通过makeMerger创建,DimensionColumnReader)handle。每个Diemension的handler
对象都特定于一个单独的dimension.
DimensionIndexer
每个dimension对应一个DimensionIndexer,用于内存中处理注入的一行数据。
ColumnCapabilitiesImpl
IncrementalIndex的构造函数中定义了每个dimension的capabilities:
private ColumnCapabilitiesImpl makeCapabilitiesFromValueType(ValueType type) {
ColumnCapabilitiesImpl capabilities = new ColumnCapabilitiesImpl();
capabilities.setDictionaryEncoded(type == ValueType.STRING);
capabilities.setHasBitmapIndexes(type == ValueType.STRING);
capabilities.setType(type);
return capabilities;
}
可见只有string类型的dimension才支持字典编码。
根据不同的capabilities生成不同的DimensionHandler:
DimensionHandler handler = DimensionHandlerUtils.getHandlerFromCapabilities(
dimName,
capabilities,
dimSchema.getMultiValueHandling()
);
addNewDimension(dimName, capabilities, handler);
一般这里都是StringDimensionHandler。实际应用中,String类型数据居多。
IncrementalIndex中写入一行数据
首先会经过parseBatch等流程,解析一行数据,最终生成MapBasedInputRow,然后调用 index.add(InputRow row)方法,添加数据,开始之后一系列处理。
对于一行数据中的某一列的值,调用
Object dimsKey = indexer.processRowValsToUnsortedEncodedKeyComponent(
row.getRaw(dimension),
reportParseExceptions
);
这里的row.getRaw(dimension)解析出一行数据中的,dimension这一列的数据。
indexer在DimensionHandler的类型为DimensionIndexer的成员。一般都是String类型的,实际类型是
StringDimensionIndexer。
看下StringDimensionIndexer的processRowValsToUnsortedEncodedKeyComponent方法。主要是维护了键-值映射。
private final Object2IntMap<String> valueToId = new Object2IntOpenHashMap<>();
private final List<String> idToValue = new ArrayList<>();
valueToId存储了一个string值对应一个int编码,idToValue则维护了id到值的关系。
processRowValsToUnsortedEncodedKeyComponent方法
在DimensionIndexer中好几个处理数据的encoder的方法。这个方法是处理原始值。
如果传如的是null,返回-1.
否则返回编码后的值:
final int[] encodedDimensionValues;
final int oldDictSize = dimLookup.size();
if (dimValues == null) {
dimLookup.add(null);
encodedDimensionValues = null;
} else if (dimValues instanceof List) {
List<Object> dimValuesList = (List) dimValues;
if (dimValuesList.isEmpty()) {
dimLookup.add(null);
encodedDimensionValues = EMPTY_INT_ARRAY;
} else if (dimValuesList.size() == 1) {
encodedDimensionValues = new int[]{dimLookup.add(STRING_TRANSFORMER.apply(dimValuesList.get(0)))};
} else {
final String[] dimensionValues = new String[dimValuesList.size()];
for (int i = 0; i < dimValuesList.size(); i++) {
dimensionValues[i] = STRING_TRANSFORMER.apply(dimValuesList.get(i));
}
if (multiValueHandling.needSorting()) {
// Sort multival row by their unencoded values first.
Arrays.sort(dimensionValues, Comparators.naturalNullsFirst());
}
final int[] retVal = new int[dimensionValues.length];
int prevId = -1;
int pos = 0;
for (String dimensionValue : dimensionValues) {
if (multiValueHandling != MultiValueHandling.SORTED_SET) {
retVal[pos++] = dimLookup.add(dimensionValue);
continue;
}
int index = dimLookup.add(dimensionValue);
if (index != prevId) {
prevId = retVal[pos++] = index;
}
}
encodedDimensionValues = pos == retVal.length ? retVal : Arrays.copyOf(retVal, pos);
}
} else {
encodedDimensionValues = new int[]{dimLookup.add(STRING_TRANSFORMER.apply(dimValues))};
}
processRowValsToUnsortedEncodedKeyComponent最终返回的是当前这行数据特定列的值在valueToId的映射中的id,也是idToValue中的索引。
每个dimension对应一个DimensionDesc,每个DimensionDesc中有一个DimensionIndexer,每个DimensionIndexer中有一个DimensionDictionary,每个DimensionDictionary中有一个valueToId,idToValue。
如果当前有10行数据,维度dim列的值'a','b','c','d','e','a','a','b','f',那么在这10列数据都掉用processRowValsToUnsortedEncodedKeyComponent之后,idToValue中的值为[a,b,c,d,e,f],valueToid
中的值为[a->0,b->1,c->2,d->3,e->4,f->5],
processRowValsToUnsortedEncodedKeyComponent返回的值为0,1,2,3,4,0,0,1,5。就是每一列都一个全局的字典。用来编码于索引数据。毕竟处理int类型方便高效。
在IncrementalIndex中,toTimeAndDims用来处理原始的一行数据,最后生成TimeAndDims包装类。
@VisibleForTesting
TimeAndDims toTimeAndDims(InputRow row) {
row = formatRow(row);
if (row.getTimestampFromEpoch() < minTimestamp) {
throw new IAE("Cannot add row[%s] because it is below the minTimestamp[%s]", row, DateTimes.utc(minTimestamp));
}
final List<String> rowDimensions = row.getDimensions();
Object[] dims;
List<Object> overflow = null;
synchronized (dimensionDescs) {
dims = new Object[dimensionDescs.size()];
for (String dimension : rowDimensions) {
boolean wasNewDim = false;
ColumnCapabilitiesImpl capabilities;
DimensionDesc desc = dimensionDescs.get(dimension);
if (desc != null) {
capabilities = desc.getCapabilities();
} else {
wasNewDim = true;
capabilities = columnCapabilities.get(dimension);
if (capabilities == null) {
capabilities = new ColumnCapabilitiesImpl();
// For schemaless type discovery, assume everything is a String for now, can change later.
capabilities.setType(ValueType.STRING);
capabilities.setDictionaryEncoded(true);
capabilities.setHasBitmapIndexes(true);
columnCapabilities.put(dimension, capabilities);
}
DimensionHandler handler = DimensionHandlerUtils.getHandlerFromCapabilities(dimension, capabilities, null);
desc = addNewDimension(dimension, capabilities, handler);
}
DimensionHandler handler = desc.getHandler();
DimensionIndexer indexer = desc.getIndexer();
Object dimsKey = indexer.processRowValsToUnsortedEncodedKeyComponent(
row.getRaw(dimension),
reportParseExceptions
);
// Set column capabilities as data is coming in
if (!capabilities.hasMultipleValues() && dimsKey != null && handler.getLengthOfEncodedKeyComponent(dimsKey) > 1) {
capabilities.setHasMultipleValues(true);
}
if (wasNewDim) {
if (overflow == null) {
overflow = Lists.newArrayList();
}
overflow.add(dimsKey);
} else if (desc.getIndex() > dims.length || dims[desc.getIndex()] != null) {
/* * index > dims.length requires that we saw this dimension and added it to the dimensionOrder map, * otherwise index is null. Since dims is initialized based on the size of dimensionOrder on each call to add, * it must have been added to dimensionOrder during this InputRow. * * if we found an index for this dimension it means we've seen it already. If !(index > dims.length) then * we saw it on a previous input row (this its safe to index into dims). If we found a value in * the dims array for this index, it means we have seen this dimension already on this input row. */
throw new ISE("Dimension[%s] occurred more than once in InputRow", dimension);
} else {
dims[desc.getIndex()] = dimsKey;
}
}
}
if (overflow != null) {
// Merge overflow and non-overflow
Object[] newDims = new Object[dims.length + overflow.size()];
System.arraycopy(dims, 0, newDims, 0, dims.length);
for (int i = 0; i < overflow.size(); ++i) {
newDims[dims.length + i] = overflow.get(i);
}
dims = newDims;
}
long truncated = 0;
if (row.getTimestamp() != null) {
truncated = gran.bucketStart(row.getTimestamp()).getMillis();
}
return new TimeAndDims(Math.max(truncated, minTimestamp), dims, dimensionDescsList);
}
之后在add()方法中调用。本质上调用了addToFacts方法。
public int add(InputRow row, boolean skipMaxRowsInMemoryCheck) throws IndexSizeExceededException {
TimeAndDims key = toTimeAndDims(row);
final int rv = addToFacts(
metrics,
deserializeComplexMetrics,
reportParseExceptions,
row,
numEntries,
key,
in,
rowSupplier,
skipMaxRowsInMemoryCheck
);
updateMaxIngestedTime(row.getTimestamp());
return rv;
}
AddToFacts方法是真正开始聚合数据的入口:
protected AddToFacts(
InputRow row,
IncrementalIndexRow key,
ThreadLocal<InputRow> rowContainer,
Supplier<InputRow> rowSupplier,
boolean skipMaxRowsInMemoryCheck
)
从AggregatorFactory产出Aggregtor
首先对metrics(类型是AggregatorFactory数组),阐述Aggregator数组:
aggs = new Aggregator[metrics.length];
factorizeAggs(metrics, aggs, rowContainer, row);
doAggregate(metrics, aggs, rowContainer, row, reportParseExceptions);
final int rowIndex = indexIncrement.getAndIncrement();
concurrentSet(rowIndex, aggs);
// Last ditch sanity checks
if (numEntries.get() >= maxRowCount
&& facts.getPriorIndex(key) == TimeAndDims.EMPTY_ROW_INDEX
&& !skipMaxRowsInMemoryCheck) {
throw new IndexSizeExceededException("Maximum number of rows [%d] reached", maxRowCount);
}
final int prev = facts.putIfAbsent(key, rowIndex);
if (TimeAndDims.EMPTY_ROW_INDEX == prev) {
numEntries.incrementAndGet();
} else {
// We lost a race
aggs = concurrentGet(prev);
//取出上一行的聚合器,聚合到当前行
doAggregate(metrics, aggs, rowContainer, row, reportParseExceptions);
// Free up the misfire
concurrentRemove(rowIndex);
// This is expected to occur ~80% of the time in the worst scenarios
}
factorizeAggs(metrics, aggs, rowContainer, row);
不同的聚合,实现aggregate方法不同。例如,CountAggregator.
public void aggregate()
{
++count;
}
其他的复杂聚合需要自己实现。每一行都生成一个聚合器,然后对这些聚合器进行聚合合并操作。
这里主要调用了doAggregate方法,这个方法里面会真正调用实现聚合的方法。
for (int i = 0; i < aggs.length; i++) {
final Aggregator agg = aggs[i];//传入的,能够实现聚合的聚合器
synchronized (agg) {
try {
agg.aggregate();//聚合方法开始调用。
} catch (ParseException e) {
// "aggregate" can throw ParseExceptions if a selector expects something but gets something else.
if (reportParseExceptions) {
throw new ParseException(e, "Encountered parse error for aggregator[%s]", metrics[i].getName());
} else {
log.debug(e, "Encountered parse error, skipping aggregator[%s].", metrics[i].getName());
}
}
}
}
生成持久化一个segment
在IndexMergeV9的persist方法中,调用merge方法:
return merge(
Collections.singletonList(
new IncrementalIndexAdapter(
dataInterval,
index,
indexSpec.getBitmapSerdeFactory().getBitmapFactory()
),
false, // rollup, no neet to rollup again
index.getMetricAggs(), // AggregatorFactory[]
outDir,
indexSpec,
progress,
segmentWriteOutMediumFactory
)
);