This step groups rows from a source, based on a specified field or collection of fields. A new row is generated for each group. It can also generate one or more aggregate values for the groups. Common uses are calculating the average sales per product and counting the number of an item you have in stock.
The Group By step is designed for sorted inputs. If your input is not sorted, only double consecutive rows are grouped correctly. If you sort the data outside of PDI, the case sensitivity of the data in the fields may produce unexpected grouping results.
You can use the Memory Group By step to handle non-sorted input.
Select an engine
You can run the Group By step on the Pentaho engine or on the Spark engine. Depending on your selected engine, the transformation will run differently. Select one of the following options to view how to set up the Group By step for your selected engine.
- Using the Group By step on the Pentaho engine: Learn how to set up this step when using the Pentaho engine.
- Using the Group By step on the Spark engine: Learn how to set up this step when using the Spark engine.
For instructions on selecting an engine from your transformation, see Run configurations.