Skip to main content
Pentaho Documentation

Process Rows

The class implementing StepInterface is responsible for the actual row processing when the transformation runs. 

The implementing class can rely on the base class and has only three important methods it implements itself. The three methods implement the step life cycle during transformation execution: initialization, row processing, and clean-up. 


File:/step_plugin_lifecycle.png

During initialization PDI calls the init() method of the step once. After all steps have initialized, PDI calls processRow() repeatedly until the step signals that it is done processing all rows. After the step is finished processing rows, PDI calls dispose()

The method signatures have a StepMetaInterface object and a StepDataInterface object. Both objects can be safely cast down to the specific implementation classes of the step. 

Aside from the methods it needs to implement, there is one additional and very important rule: the class must not declare any fields. All variables must be kept as part of the class implementing StepDataInterface. In practice this is not a problem, since the object implementing StepDataInterface is passed in to all relevant methods, and its fields are used instead of local ones. The reason for this rule is the need to decouple step variables from instances of StepInterface. This enables PDI to implement different threading models to execute a transformation.

Step Initialization

The init() method is called when a transformation is preparing to start execution. 

public boolean init()

Every step is given the opportunity to do one-time initialization tasks, such as opening files or establishing database connections. For any steps derived from BaseStep, it is mandatory that super.init() is called to ensure correct behavior. The method returns true in case the step initialized correctly, it returns false if there is an initialization error. PDI will abort the execution of a transformation in case any step returns false upon initialization.

Row Processing

Once the transformation starts, it enters a tight loop, calling processRow() on each step until the method returns false. In most cases, each step reads a single row from the input stream, alters the row structure and fields, and passes the row on to the next step. Some steps, such as input, grouping, and sorting steps, read rows in batches, or can hold on to the read rows to perform other processing before passing them on to the next step.

public boolean processRow()

A PDI step queries for incoming input rows by calling getRow(), which is a blocking call that returns a row object or null in case there is no more input. If there is an input row, the step does the necessary row processing and calls putRow() to pass the row on to the next step. If there are no more rows, the step calls setOutputDone() and returns false.

The method must conform to these rules.

  • If the step is done processing all rows, the method calls setOutputDone() and returns false.
  • If the step is not done processing all rows, the method returns true. PDI calls processRow() again in this case.

The sample step plugin project shows an implementation of processRow() that is commonly used in data processing steps.

In contrast to that, input steps do not usually expect any incoming rows from previous steps. They are designed to execute processRow() exactly once, fetching data from the outside world, and putting them into the row stream by calling putRow() repeatedly until done. Examining existing PDI steps is a good guide for designing your processRow() method. 

The row structure object is used during the first invocation of processRow() to determine the indexes of fields on which the step operates. The BaseStep class already provides a convenient First flag to help implement special processing on the first invocation of processRow(). Since the row structure is equal for all input rows, steps cache field index information in variables on their StepDataInterface object.

Step Clean-Up

Once the transformation is complete, PDI calls dispose() on all steps. 

Public void dispose()

Steps are required to deallocate resources allocated during init() or subsequent row processing. Your implementation should clear all fields of the StepDataInterface object, and ensure that all open files or connections are properly closed. For any steps derived from BaseStep, it is mandatory that super.dispose() is called to ensure correct deallocation.