Skip to main content
Hitachi VantaraPentaho Documentation
Pentaho Documentation

Work with Aggregation Tables

In traditional OLAP systems, pre-aggregating data is often done as a way to improve performance when the fact table contains a huge number of rows such as a million or more rows. An aggregate table is essentially a pre-computed summary of the data in the fact table. Analyzer for MongoDB supports this technique by allowing collections containing grouped and summarized data to be defined as aggregate collections within the schema.

Example of a Simple Aggregate Collection

Consider the Foodmart collection, which includes a fairly large number of attributes and three unsummarized measure values.

{
"_id" : ObjectId("527ab2396a65f6f4b153cee5"),
"product_subcategory" : "Nuts",
"product_category" : "Specialty",
"product_department" : "Produce",
"product_family" : "Food",
"product_id" : 195,
"product_class_id" : 1,
"the_year" : 1997,
"the_month" : "June",
"the_day" : "Sunday",
"quarter" : "Q2",
"country" : "USA",
"state_province" : "WA",
"city" : "Sedro Woolley",
"marital_status" : "S",
"gender" : "M",
"marital_status" : "S",
"store_sales" : 1.77,
"store_cost" : 0.5664,
"unit_sales" : 1
}

Perhaps analysts frequently want to see Unit Sales values broken down by month.  In order to improve performance of these queries, generate a collection which groups by month_of_year, quarter, and the_year, with the summarized unit_sales values.

{ "unit_sales" : 21628, "fact_count" : 7034, "month_of_year" : 1, "quarter" : "Q1", "the_year" : 1997 }
{ "unit_sales" : 20957, "fact_count" : 6844, "month_of_year" : 2, "quarter" : "Q1", "the_year" : 1997 }
{ "unit_sales" : 23706, "fact_count" : 7710, "month_of_year" : 3, "quarter" : "Q1", "the_year" : 1997 }
{ "unit_sales" : 20179, "fact_count" : 6590, "month_of_year" : 4, "quarter" : "Q2", "the_year" : 1997 }
…

This aggregate collection can then be used in place of the full fact table wherever possible.  Note that this collection includes a special fact_count value, which is simply the count of all fact table rows that were rolled up for that slice of the data.  This value can be used to compute things like averages when using the aggregate collection. With MongoDB versions 2.5.2 and above, creating an aggregate collection like the one above can be done in a single aggregate pipeline query using the $out operator.

[{$project : { month_of_year:1, quarter:1, the_year:1, unit_sales:1, fact_count : 1}}, 
{ $group : 
     { _id : { month_of_year : "$month_of_year", "quarter": "$quarter", "the_year": "$the_year" },
            unit_sales : {$sum : "$unit_sales"}, fact_count: {$sum : "$fact_count"} } }, 
{$project : { _id : 0, "month_of_year" : "$_id.month_of_year", "quarter" : "$_id.quarter", 
           "the_year" : "$_id.the_year", "unit_sales" : "$unit_sales" , "fact_count" : "$fact_count" }},
{$out : ‘agg_time_unit_sales’}]

Include an Aggregate Collection in the Physical Schema

As described in the Physical Schema section, any MongoDB collection used within the model needs to be defined as a table, with all attributes specified as CalculatedColumnDefs.  So the aggregate collection above would need to be included in the Physical Schema as a <Table> element.

<Table name='agg_time_unit_sales'>
      <ColumnDefs>
        <CalculatedColumnDef name='the_year' type="String" >
          <ExpressionView >
            <SQL dialect='generic'>
              $the_year
            </SQL>
          </ExpressionView>
        </CalculatedColumnDef>
        <CalculatedColumnDef name='quarter' type="String" >
          <ExpressionView >
            <SQL dialect='generic'>
              $quarter
            </SQL>
          </ExpressionView>
        </CalculatedColumnDef>
        <CalculatedColumnDef name=’month_of_year' type="String" >
          <ExpressionView >
            <SQL dialect='generic'>
              $month_of_year
            </SQL>
          </ExpressionView>
        </CalculatedColumnDef>
        <CalculatedColumnDef name=’unit_sales' type="String" >
          <ExpressionView >
            <SQL dialect='generic'>
              $unit_sales
            </SQL>
          </ExpressionView>
        </CalculatedColumnDef>
           <CalculatedColumnDef name=’fact_count' type="String" >
          <ExpressionView >
            <SQL dialect='generic'>
              $fact_count
            </SQL>
          </ExpressionView>
        </CalculatedColumnDef>
           </ColumnDefs>
           </Table>

Add an Aggregate Measure Group

Once the aggregate table has been defined in the Physical Schema, it can then be added to your cube as a new measure group.  A <MeasureGroup> defined with type=’aggregate’ indicates that the measures and dimensions contained within that group can be resolved by way of the collection.

    <MeasureGroup table='agg_time_unit_sales' type='aggregate'>
        <Measures>
          <MeasureRef name='Fact Count' aggColumn='fact_count'/>
          <MeasureRef name='Unit Sales' aggColumn='unit_sales'/>
        </Measures>
        <DimensionLinks>
          <CopyLink dimension='Time' attribute='Month'>
            <Column aggColumn='the_year' table='sales'  name='the_year'/>
            <Column aggColumn='quarter' table='sales' name='quarter'/>
            <Column aggColumn='month_of_year' table='sales'  name='month_of_year'/>
          </CopyLink>
        </DimensionLinks>
      </MeasureGroup>

If your cube defines multiple aggregate measure groups, Analyzer for MongoDB will select the optimal one by determining which one contains all of the necessary attributes and measures.  In cases where more than one collection meets this criteria, the collection with the fewest documents will be selected.

Add Distinct Count Feature

There is an important exception for distinct-count measures: they cannot in be rolled up over arbitrary dimensions. To see why, consider the case of a supermarket chain which has two stores in the same city. Suppose that Store A has 1000 visits from 800 distinct customers in the month of July, while Store B has 1500 visits from 900 distinct customers. Clearly the two stores had a total of 2500 customer visits between them, but how many distinct customers? We can say that there were at least 900, and maybe as many as 1700, but assuming that some customers visit both stores, and the real total will be somewhere in between. "Distinct customers" is an example of a distinct-count measure, and cannot be deduced by rolling up subtotals. You have to go back to the raw data in the fact table.

There is a special case where it is acceptable to roll up distinct count measures. Suppose that we know that in July, this city's stores (Store A and B combined) have visits from 600 distinct female customers and 700 distinct male customers. Can we say that the number of distinct customers in July is 1300? Yes we can, because we know that the sets of male and female customers cannot possibly overlap. In technical terms, gender is functionally dependent on customer id.

The rule for rolling up distinct measures can be stated as follows:

" A distinct count measure over key k can be computed by rolling up more granular subtotals only if the attributes which are being rolled up are functionally dependent on k."

Even with this special case, it is difficult to create enough aggregate tables to satisfy every possible query. When evaluating a distinct-count measure, Mondrian can only use an aggregate table if it has the same logical/level granularity as the cell being requested, or can be rolled up to that granularity only by dropping functionally dependent attributes. If there is no aggregate table of the desired granularity, Mondrian goes instead against the fact table.

This has implications for aggregate design. If your application makes extensive use of distinct-count measures, you will need to create an aggregate table for each granularity where it is used. That could be a lot of aggregate tables!

That said, Mondrian will rollup measures in an aggregate table that contains one or more distinct-count measures if none of the distinct-count measures are requested. In that respect an aggregate table containing distinct-count measures are just like any other aggregate table as long as the distinct-count measures are not needed. And once in memory, distinct-count measures are cached like other measures, and can be used for future queries.

When building an aggregate table that will contain a distinct-count measure, the measure must be rolled up to a logical dimension level, which is to say that the aggregate table must be a collapsed dimension aggregate. If it is rolled up only to the dimension's foreign key, there is no guarantee that the foreign key is at the same granularity as the lowest logical level, which is what is used by MDX requests. So for an aggregate table that only rolls the distinct-count measure to the foreign key granularity, a request of that distinct-count measure may result in further rollup and, therefore, an error.

The general rule when building aggregate tables involving distinct-count measures is that there can be NO foreign keys remaining in the aggregate table - for each base table foreign key, it must either be dropped, a lost dimension aggregate, or it must be replaces with levels, a collapsed dimension aggregate. In fact, this rule, though not required, is useful to follow when creating any aggregate table; there is no value in maintaining foreign keys in aggregate tables. They should be replaced by collapsing to levels unless the larger memory used by such aggregate tables is too much for one's database system.