Page History

...

But with this partial aggregates, it is not possible to remove the values of the first event without knowing its content. In Odysseus we adapted the approach of Partial aggregates. Partial aggregates are built only from events that overlap in their time interval. For this, new events must be combined with different existing partial aggregates. The following figure illustrates an example.

Image Added

Fig. 2 illustrates an example. To make the gure more readable, this example shows a count aggregation. The AGGRE-GATE AGGREGATE operator keeps a state of all current partial aggregates.

In this example there are two partial aggregates with the count 4 and the count 2 in the upper part of the figure. If a new event arrives at the system (here with the validity from t1 to t4), the new event has to be compared with all contained events. The lower area shows the result of the processing. The event with count the 4 is not treated as it does no overlap with the new event. From t1 to t2 a new event with the count 1 is inserted, analogous from t3 to t4. Within the range from t2 to t3, an overlapping occurs. The new partial aggregate is derived from the old one with the new event, leading to a new partial aggregate with the count 3.

The final question is: When can any of the contained event be evaluated and sent to the next operator? This depends on how time progress is determined. When the stream is ordered by start time stamp, the partial aggregate with the count 4 will not be used anymore because every following event must have a time stamp higher that t1 and the partial aggregate can be removed from the state and sent to the next operator.

A special case is the treatment of groups, e.g., if not all elements should be counted, but the elements with a special attribute value (e.g. the count of bids for an auction). In this case, the AGGREGATE operator keeps an own state for each group and provides special handling for out of order events.

This operator is used to aggregate and group the input stream.

Parameter

group_by: An optional list of attributes over which the grouping should occur.
aggregations: A list if elements where each element contains up to four string:
- The name of the aggregate function, e.g. MAX
- The input attribute over which the aggregation should be done
- The name of the output attribute for this aggregation
- The optional type of the output. If not given, DOUBLE will be used.
dumpAtValueCount: This parameter is used in the interval approach. Here a result is generated, each time an aggregation cannot be changed anymore. This leads to fewer results with a larger validity. With this parameter the production rate can be raised. A value of 1 means, that for every new element the aggregation operator receives new output elements are generated. This leads to more results with a shorter validity.
outputPA: This parameter allow to dump partial aggregates instead of evaluted values. The partial aggregates can be send to other aggregation operators and do a final aggregation (e.g. in case of distribution). The input schema of an aggregate operator that read partial aggregates must state a datatype that is a partial aggregated (see example below). Remark: Aggregate has one input and requires ordered input. To combine different parital aggregations e.g. a union operator is needed to reorder the input elements.
drainAtDone: Boolean, default true: If done is called, all not already written elements will be written.
drainAtClose: Boolean, default false: If close is called, all not already written elements will be written.
FastGrouping: Use hash code instead of compare to create group. Potentially unsafe!

...

Space shortcuts

Page tree

Versions Compared

Old Version 12

New Version 13

Key