Pentaho Data Integration performance tips

Last updated
Save as PDF

To substantially increase performance in Pentaho Repository transactions, we recommend upgrading to the latest version of Pentaho Data Integration (PDI). Besides upgrading, here are some tips and tricks to improve PDI performance. Most tips involve streamlining jobs and transformations. The following tips may help you to identify and correct performance-related issues associated with PDI transformations:

Step	Tip	Description
JavaScript	Turn off compatibility mode	Rewriting JavaScript to use a format that is not compatible with previous versions is, in most instances, easy to do and makes scripts easier to work with and to read. By default, old JavaScript programs run in compatibility mode. That means that the step will process like it did in a previous version. You may see a small performance drop because of the overload associated with forcing compatibility. If you want make use of the new architecture, disable compatibility mode and change the code as shown below: `intField.getInteger() --> intField` `numberField.getNumber() --> numberField` `dateField.getDate() --> dateField` `bigNumberField.getBigNumber() --> bigNumberField` and so on... Instead of Java methods, use the built-in library. Notice that the resulting program code is more intuitive. For example : checking for null is now: `field.isNull() --> field==null` Converting string to date: `field.Clone().str2dat() --> str2date(field)` and so on... If you convert your code as shown above, you may get significant performance benefits. NoteIt is no longer possible to modify data in-place using the value methods. This was a design decision to ensure that no data with the wrong type would end up in the output rows of the step. Instead of modifying fields in-place, create new fields using the table at the bottom of the Modified JavaScript transformation.
JavaScript	Combine steps	One large JavaScript step runs faster than three consecutive smaller steps. Combining processes in one larger step helps to reduce overhead.
JavaScript	Avoid the JavaScript step or write a custom plugin	Remember that while JavaScript is the fastest scripting language for Java, it is still a scripting language. If you do the same amount of work in a native step or plugin, you avoid the overhead of the JS scripting engine. This has been known to result in significant performance gains. It is also the primary reason why the Calculator step was created — to avoid the use of JavaScript for simple calculations.
JavaScript	Create a copy of a field	No JavaScript is required for this; a Select Values step does the trick. You can specify the same field twice. Once without a rename, once (or more) with a rename. Another trick is to use B=NVL(A,A) in a Calculator step where B is forced to be a copy of A. An explicit "create copy of field A" function has been added to the Calculator.
JavaScript	Data conversion	Consider performing conversions between data types (dates, numeric data, and so on) in a Select Values step. You can do this in the Metadata tab of the step.
JavaScript	Variable creation	If you have variables that can be declared once at the beginning of the transformation, make sure you put them in a separate script and mark that script as a startup script (right click on the script name in the tab). JavaScript object creation is time consuming so if you can avoid creating a new object for every row you are transforming, this will translate to a performance boost for the step.
Not applicable	Launch several copies of a step	There are two important reasons why launching multiple copies of a step may result in better performance: The step uses a lot of CPU resources and you have multiple processor cores in your computer. Example: a JavaScript step Network latencies and launching multiple copies of a step can reduce average latency. If you have a low network latency of say 5ms and you need to do a round trip to the database, the maximum performance you get is 200 (x5) rows per second, even if the database is running smoothly. You can try to reduce the round trips with caching, but if not, you can try to run multiple copies. Example: a database lookup or table output
Not applicable	Manage thread priorities	This feature that is found in the Transformation Settings dialog box under the Misc tab improves performance by reducing the locking overhead in certain situations. This feature is enabled by default for new transformations that are created in recent versions, but for older transformations this can be different.
Select Value	If possible, don't remove fields in Select Value	Don't remove fields in Select Value unless you must. It's a CPU-intensive task as the engine needs to reconstruct the complete row. It is almost always faster to add fields to a row rather than delete fields from a row.
Get Variables	Watch your use of Get Variables	May cause bottlenecks if you use it in a high-volume stream (accepting input). To solve the problem, take the Get Variables step out of the transformation (right click, detach) then insert it in with a Join Rows (cart prod) step. Make sure to specify the main step from which to read in the Join Rows step. Set it to the step that originally provided the Get Variables step with data.
Not applicable	Use new text file input	The new CSV Input or Fixed Input steps provide optimal performance. If you have a fixed width (field/row) input file, you can even read data in parallel. (multiple copies) These new steps have been rewritten using Non-blocking I/O (NIO) features. Typically, the larger the NIO buffer you specify in the step, the better your read performance will be.
Not applicable	When appropriate, use lazy conversion	In instances in which you are reading data from a text file and you write the data back to a text file, use Lazy conversion to speed up the process. The principle behind lazy conversion that it delays data conversion in hopes that it isn't necessary (reading from a file and writing it back comes to mind). Beyond helping with data conversion, lazy conversion also helps to keep the data in "binary" storage form. This, in turn, helps the internal Kettle engine to perform faster data serialization (sort, clustering, and so on). The Lazy Conversion option is available in the CSV Input and Fixed input text file reading steps.
Join Rows	Use Join Rows	You need to specify the main step from which to read. This prevents the step from performing any unnecessary spooling to disk. If you are joining with a set of data that can fit into memory, make sure that the cache size (in rows of data) is large enough. This prevents (slow) spooling to disk.
Not applicable	Review the big picture: database, commit size, row set size and other factors	Consider how the whole environment influences performance. There can be limiting factors in the transformation itself and limiting factors that result from other applications and PDI. Performance depends on your database, your tables, indexes, the JDBC driver, your hardware, speed of the LAN connection to the database, the row size of data and your transformation itself. Test performance using different commit sizes and changing the number of rows in row sets in your transformation settings. Change buffer sizes in your JDBC drivers or database.
Not applicable	Step Performance Monitoring	You can track the performance of individual steps in a transformation. Step Performance Monitoring is an important tool that allows you identify the slowest step in your transformation.

Limit in-memory log output

PDI logs data about transformations and jobs according to default parameters that control how many lines are allowed in the log and how long the oldest line should stay in memory before it is released. Obviously the more lines that are recorded and the longer they are kept, the more heap space is consumed by them. If you are experiencing memory shortages or slow performance in your PDI content, you can address the problem by modifying in-memory logging.

In Spoon, the following parameters control logging:

KETTLE_MAX_LOG_SIZE_IN_LINES
the maximum number of log lines that are kept internally by Kettle. Setting this to 0 (the default) forces PDI to keep all rows.
KETTLE_MAX_LOG_TIMEOUT_IN_MINUTES
the maximum age (in minutes) that a log line should be kept internally by PDI. Setting this to 0 (the default) keeps all rows indefinitely.
KETTLE_MAX_JOB_TRACKER_SIZE
the maximum number of job trackers kept in memory. Default value is: 1000.
KETTLE_MAX_JOB_ENTRIES_LOGGED
the maximum number of job entry results kept in memory for logging purposes. Default value is: 1000.
KETTLE_MAX_LOGGING_REGISTRY_SIZE
the maximum number of logging registry entries kept in memory for logging purposes. Default value is: 1000.

The equivalent parameters to the first two variables, which can be set on each KTR or KJB individually using Kitchen or Pan, are:

maxloglines
maxlogtimeout

Set these values to the lowest non-zero values that your operations can tolerate. If you are using logging for any purpose, you must balance between tolerable performance and necessary functionality.

HBase Output performance considerations

The HBase Output step's Configure connection tab provides a field for setting the size of the write buffer used to transfer data to HBase. A larger buffer consumes more memory (on both the client and server), but results in fewer remote procedure calls. The default (defined in the hbase-default.xml file) is: 2MB When left blank, the buffer is 2MB, auto flush is enabled, and Put operations are executed immediately. This means that each row will be transmitted to HBase as soon as it arrives at the step. Entering a number (even if it is the same as the default) for the size of the write buffer will disable auto flush and will result in incoming rows only being transferred once the buffer is full.

There is also a checkbox for Disable write to WAL that disables writing to the Write Ahead Log (WAL). The WAL is used as a lifeline to restore the status quo if the server goes down while data is being inserted. However, the tradeoff for error-recovery is speed.

The Create/Edit mappings tab has options for creating new tables. In the HBase table name field, you can suffix the name of the new table with parameters for specifying what kind of compression to use, and whether or not to use Bloom filters to speed up lookups. The options for compression are: NONE, GZ and LZO; the options for Bloom filters are: NONE, ROW, and ROWCOL. If nothing is selected (or only the name of the new table is defined), then the default of NONE is used for both compression and Bloom filters. For example, the following string entered in the HBase table name field specifies that a new table called "NewTable" should be created with GZ compression and ROWCOL Bloom filters:

NewTable@GZ@ROWCOL

ImportantDue to licensing constraints, HBase does not ship with LZO compression libraries; these must be manually installed on each node if you want to use LZO compression.

HBase Input performance considerations

Specifying fields in the HBase Input step's Configure query tab will result in scans that return just those columns. Since HBase is a sparse column-oriented database, this requires that HBase check to see whether each row contains a specific column. More lookups equate to reduced speed, although the use of Bloom filters (if enabled on the table in question) mitigates this to a certain extent. If, on the other hand, the fields table in the Configure query tab is left blank, it results in a scan that returns rows that contain all columns that exist in each row (not only those that have been defined in the mapping). However, the HBase Input step will only omit those columns that are defined in the mapping being used. Because all columns are returned, HBase does not have to do any lookups. However, if the table in question contains many columns and is dense, then this will result in more data being transferred over the network.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Limit in-memory log output

HBase Output performance considerations

HBase Input performance considerations