Skip to main content
Pentaho Documentation

Transformation Step Reference

Overview

Index supported Pentaho Data Integration transformation step documentation.

This page contains the index supported Pentaho Data Integration transformation step documentation.

Steps: A - F

Name Category  Description
Abort Flow Abort a transformation
Add a checksum Transform Add a checksum column for each input row
Add constants Transform Add one or more constants to the input rows
Add sequence Transform Get the next value from an sequence
Add value fields changing sequence Transform Add sequence depending of fields value change. Each time value of at least one field change, PDI will reset sequence.
Add XML Transform Encode several fields into an XML fragment
Analytic Query Statistics Execute analytic queries over a sorted dataset (LEAD/LAG/FIRST/LAST)
Append streams Flow Append 2 streams in an ordered way
ARFF Output Data Mining Writes data in ARFF format to a file
Automatic Documentation Output Output This step automatically generates documentation based on input in the form of a list of transformations and jobs
Avro input Input Decode binary or Json Avro data from a file or a field
Block this step until steps finish Flow Block this step until selected steps finish.
Blocking Step Flow This step blocks until all incoming rows have been processed. Subsequent steps only recieve the last input row to this step.
Calculator Transform Create new fields by performing simple calculations
Call DB Procedure Lookup Get back information by calling a database procedure.
Call Endpoint BA Server  Calls API endpoints from the BA server within a PDI transformation.
Change file encoding Utility Change file encoding and create a new file
Cassandra input Big Data Read from a Cassandra column family
Cassandra output Big Data Write to a Cassandra column family
Check if a column exists Lookup Check if a column exists in a table on a specified connection.
Check if file is locked Lookup Check if a file is locked by another process
Check if webservice is available Lookup Check if a webservice is available
Clone row Utility Clone a row as many times as needed
Closure Generator Transform This step allows you to generates a closure table using parent-child relationships.
Combination lookup/update Data Warehouse Update a junk dimension in a data warehouse. Alternatively, look up information in this dimension. The primary key of a junk dimension are all the fields.
Concat Fields Transform The Concat Fields step is used to concatenate multiple fields into one target field. The fields can be separated by a separator and the enclosure logic is completely compatible with the Text File Output step.
Copy rows to result Job Use this step to write rows to the executing job. The information will then be passed to the next entry in this job.
CouchDB Input Big Data Retrieves all documents from a given view in a given design document from a given database
Credit card validator Validation The Credit card validator step will help you tell: (1) if a credit card number is valid (uses LUHN10 (MOD-10) algorithm) (2) which credit card vendor handles that number (VISA, MasterCard, Diners Club, EnRoute, American Express (AMEX),...)
CSV file input Input Simple CSV file input
Data Grid Input Enter rows of static data in a grid, usually for testing, reference or demo purpose
Data Validator Validation Validates passing data based on a set of rules
Database join Lookup Execute a database query using stream values as parameters
Database lookup Lookup Look up values in a database using field values
De-serialize from file Input Read rows of data from a data cube.
Delay row Utility Output each input row after a delay
Delete Output Delete data in a database table based upon keys
Detect empty stream Flow This step will output one empty row if input stream is empty (ie when input stream does not contain any row)
Dimension lookup/update Data Warehouse Update a slowly changing dimension in a data warehouse. Alternatively, look up information in this dimension.
Dummy (do nothing) Flow This step type doesn't do anything. It's useful however when testing things or in certain situations where you want to split streams.
Dynamic SQL row Lookup Execute dynamic SQL statement build in a previous field
Edi to XML Utility Converts an Edifact message to XML to simplify data extraction (Available in PDI 4.4, already present in CI trunk builds)
ElasticSearch Bulk Insert Bulk loading Performs bulk inserts into ElasticSearch
Email messages input Input Read POP3/IMAP server and retrieve messages
ESRI Shapefile Reader Input Reads shape file data from an ESRI shape file and linked DBF file
ETL Metadata Injection Flow This step allows you to inject metadata into an existing transformation prior to execution. This allows for the creation of dynamic and highly flexible data integration solutions.
Example plugin Transform This is an example for a plugin test step
Execute a process Utility Execute a process and return the result
Execute row SQL script Scripting Execute SQL script extracted from a field created in a previous step.
Execute SQL script Scripting Execute an SQL script, optionally parameterized using input rows
File exists Lookup Check if a file exists
Filter Rows Flow Filter rows using simple equations
Fixed file input Input Fixed file input
Formula Scripting Calculate a formula using Pentaho's libformula
Fuzzy match Lookup Finding approximate matches to a string using matching algorithms. Read a field from a main stream and output approximative value from lookup stream.

 

Steps: G - L

Name Category  Description
Generate random credit card numbers Input Generate random valide (luhn check) credit card numbers
Generate random value Input Generate random value
Generate Rows Input Generate a number of empty or equal rows.
Get data from XML Input Get data from XML file by using XPath. This step also allows you to parse XML defined in a previous field.
Get File Names Input Get file names from the operating system and send them to the next step.
Get files from result Job This step allows you to read filenames used or generated in a previous entry in a job.
Get Files Rows Count Input Get Files Rows Count
Get ID from slave server Transform Retrieves unique IDs in blocks from a slave server. The referenced sequence needs to be configured on the slave server in the XML configuration file.
Get repository names Input Lists detailed information about transformations and/or jobs in a repository
Get rows from result Job This allows you to read rows from a previous entry in a job
Get Session Variables BA Server  Retrieves the value of a session variable
Get SubFolder names Input Read a parent folder and return all subfolders
Get System Info Input Get information from the system like system date, arguments, etc.
Get table names Input Get table names from database connection and send them to the next step
Get Variables Job Determine the values of certain (environment or Kettle) variables and put them in field values.
Google Analytics Input Fetches data from google analytics account
Google Docs Input Input   
[Greenplum Bulk Loader] Deprecated Greenplum Bulk Loader
Greenplum Load Bulk loading  Greenplum Load 
Group by Statistics Builds aggregates in a group by fashion. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.
GZIP CSV Input Input Parallel GZIP CSV file input reader
Hadoop File Input Big Data Read data from a variety of different text-file types stored on a Hadoop cluster
Hadoop File Output Big Data Write data to a variety of different text-file types stored on a Hadoop cluster
HBase input Big Data Read from an HBase column family
HBase output Big Data Write to an HBase column family
HBase Row Decoder Big Data Decodes an incoming key and HBase result object according to a mapping
HL7 Input Input  Read data from HL7 data streams.
HTTP client Lookup Call a web service over HTTP by supplying a base URL by allowing parameters to be set dynamically
HTTP Post Lookup Call a web service request over HTTP by supplying a base URL by allowing parameters to be set dynamically
IBM Websphere MQ Consumer Input  Receive messages from any IBM Websphere MQ Server
IBM Websphere MQ Producer Output  Send messages to any IBM Websphere MQ Server
Identify last row in a stream Flow Last row will be marked
If field value is null Utility Sets a field value to a constant if it is null.
Infobright Loader Bulk loading Load data to an Infobright database table
Ingres VectorWise Bulk Loader Bulk loading This step interfaces with the Ingres VectorWise Bulk Loader "COPY TABLE" command.
Injector Inline Injector step to allow to inject rows into the transformation through the java API
Insert / Update Output Update or insert rows in a database based upon keys.
Java Filter Flow Filter rows using java code
JMS Consumer Input  Receive messages from a JMS server
JMS Producer Output  Send messages to a JMS server
Job Executor Flow  This step executes a Pentaho Data Integration Job, passes parameters and rows. 
Join Rows (cartesian product)  Joins The output of this step is the cartesian product of the input streams. The number of rows is the multiplication of the number of rows in the input streams.
JSON Input Input Extract relevant portions out of JSON structures (file or incoming field) and output rows
JSON output Output Create Json bloc and output it in a field ou a file.
Knowledge Flow Data Mining Executes a Knowledge Flow data mining process
LDAP Input Input Read data from LDAP host
LDAP Output Output Perform Insert, upsert, update, add or delete operations on records based on their DN (Distinguished Name).
LDIF Input Input Read data from LDIF files
Load file content in memory Input Load file content in memory
LucidDB Streaming Loader Deprecated Load data into LucidDB by using Remote Rows UDX.

Steps: M - R

Name Category  Description
Mail Utility Send eMail.
Mail Validator Validation Check if an email address is valid.
Mapping (sub-transformation) Mapping Run a mapping (sub-transformation), use MappingInput and MappingOutput to specify the fields interface
Mapping input specification Mapping Specify the input interface of a mapping
Mapping output specification Mapping Specify the output interface of a mapping
MapReduce Input Big Data Key Value pairs enter here from Hadoop MapReduce
MapReduce Output Big Data Key Value pairs exit here and are pushed into Hadoop MapReduce
MaxMind GeoIP Lookup Lookup Lookup an IPv4 address in a MaxMind database and add fields such as geography, ISP, or organization.
Memory Group by Statistics Builds aggregates in a group by fashion. This step doesn't require sorted input.
Merge Join Joins Joins two streams on a given key and outputs a joined set. The input streams must be sorted on the join key
Merge Rows (diff) Joins Merge two streams of rows, sorted on a certain key. The two streams are compared and the equals, changed, deleted and new rows are flagged.
Metadata structure of stream Utility This is a step to read the metadata of the incoming stream.
Microsoft Access Input  Input Read data from a Microsoft Access file
Microsoft Access Output Output Stores records into an MS-Access database table.
Microsoft Excel Input Input Read data from Excel and OpenOffice Workbooks (XLS, XLSX, ODS).
Microsoft Excel Output Output Stores records into an Excel (XLS) document with formatting information.
Microsoft Excel Writer Output Writes or appends data to an Excel file
Modified Java Script Value Scripting This steps allows the execution of JavaScript programs (and much more) 
Mondrian Input Input Execute and retrieve data using an MDX query against a Pentaho Analyses OLAP server (Mondrian)
MonetDB Agile Mart Agile   
MonetDB Bulk Loader Bulk loading Load data into MonetDB by using their bulk load command in streaming mode.
MongoDB Input Big Data Reads all entries from a MongoDB collection in the specified database.
MongoDB Output Big Data Write to a MongoDB collection.
Multiway Merge Join Joins Multiway Merge Join
MySQL Bulk Loader Bulk loading MySQL bulk loader step, loading data over a named pipe (not available on MS Windows)
Null if... Utility Sets a field value to null if it is equal to a constant value
Number range Transform Create ranges based on numeric field
OLAP Input Input Execute and retrieve data using an MDX query against any XML/A OLAP datasource using olap4j
OpenERP Object Delete Delete  Deletes data from the OpenERP server using the XMLRPC interface with the 'unlink' function. 
OpenERP Object Input Input Retrieves data from the OpenERP server using the XMLRPC interface with the 'read' function.
OpenERP Object Output Output  Updates data on the OpenERP server using the XMLRPC interface and the 'import' function 
Oracle Bulk Loader Bulk loading Use Oracle Bulk Loader to load data
Output steps metrics Statistics Return metrics for one or several steps
Palo Cell Input Input  Retrieves all cell data from a Palo cube 
Palo Cell Output Output  Updates cell data in a Palo cube 
Palo Dimension Input Input  Returns elements from a dimension in a Palo database 
Palo Dimension Output Output  Creates/updates dimension elements and element consolidations in a Palo database 
Pentaho Reporting Output Output Executes an existing report (PRPT)
PostgreSQL Bulk Loader Bulk loading PostgreSQL Bulk Loader
Prioritize streams Flow Prioritize streams in an order way.
Process files Utility Process one file per row (copy or move or delete). This step only accept filename in input.
Properties Output Output Write data to properties file
Property Input Input Read data (key, value) from properties files.
R script executor Statistics Executes an R script within a PDI transformation
Regex Evaluation Scripting Regular expression Evaluation. This step uses a regular expression to evaluate a field. It can also extract new fields out of an existing field with capturing groups.
Replace in string Transform Replace all occurences a word in a string with another word.
Reservoir Sampling Statistics Transform Samples a fixed number of rows from the incoming stream
REST Client Lookup Consume RESTfull services. REpresentational State Transfer (REST) is a key design idiom that embraces a stateless client-server architecture in which the web services are viewed as resources and can be identified by their URLs
Row denormaliser Transform Denormalises rows by looking up key-value pairs and by assigning them to new fields in the output rows. This method aggregates and needs the input rows to be sorted on the grouping fields
Row flattener Transform Flattens consecutive rows based on the order in which they appear in the input stream
Row Normaliser Transform De-normalised information can be normalised using this step type.
RSS Input Input Read RSS feeds
RSS Output Output Read RSS stream.
Rule Executor Scripting Execute a rule against each row (using Drools)
Rule Accumulator Scripting Execute a rule against a set of rows (using Drools)
Run SSH commands Utility Run SSH commands and returns result.

Steps: S - Z

Name Category  Description
S3 CSV Input Input S3 CSV Input
S3 File Output Output Exports data to a text file on an Amazon Simple Storage Service (S3)
Salesforce Delete Output Delete records in Salesforce module.
Salesforce Input Input Reads information from SalesForce 
Salesforce Insert Output Insert records in Salesforce module.
Salesforce Update Output Update records in Salesforce module.
Salesforce Upsert Output Insert or update records in Salesforce module.
Sample rows Statistics Filter rows based on the line number.
SAP Input Input Read data from SAP ERP, optionally with parameters
SAS Input Input This step reads files in sas7bdat (SAS) native format
Script Experimental  
Secret key generator Experimental Generate secrete key for algorithms such as DES, AEC, TripleDES.
Select values  Transform Select or remove fields in a row. Optionally, set the field meta-data: type, length and precision.
Send message to Syslog Utility Send message to Syslog server
Serialize to file Output Write rows of data to a data cube
Set field value Transform Replace value of a field with another value field
Set field value to a constant Transform Replace value of a field to a constant
Set files in result Job This step allows you to set filenames in the result of this transformation. Subsequent job entries can then use this information.
Set Session Variables BA Server  Allows you to set the value of session variable
Set Variables Job Set environment variables based on a single input row.
SFTP Put Experimental  
Simple Mapping Mapping  Turn a repetitive, re-usable part of a transformation (a sequence of steps) into a mapping (sub-transformation).
Single Threader Flow Executes a transformation snippet in a single thread. You need a standard mapping or a transformation with an Injector step where data from the parent transformation will arive in blocks.
Socket reader Inline Socket reader. A socket client that connects to a server (Socket Writer step).
Socket writer Inline Socket writer. A socket server that can send rows of data to a socket reader.
Sort rows Transform Sort rows based upon field values (ascending or descending)
Sorted Merge  Joins Sorted Merge
Split field to rows Transform Splits a single string field by delimiter and creates a new row for each split term
Split Fields Transform When you want to split a single field into more then one, use this step type.
Splunk Input Transform Reads data from Splunk.
Splunk Output Transform Writes data to Splunk.
SQL File Output Output Output SQL INSERT statements to file
Stream lookup Lookup Look up values coming from another stream in the transformation.
SSTable Output Big Data writes to a filesystem directory as a Cassandra SSTable
String operations Transform Apply certain operations like trimming, padding and others to string value.
Strings cut Transform Strings cut (substring).
Switch / Case Flow Switch a row to a certain target step based on the case value in a field.
Symmetric Cryptography Experimental Encrypt or decrypt a string using symmetric encryption. Available algorithms are DES, AEC, TripleDES.
Synchronize after merge Output This step perform insert/update/delete in one go based on the value of a field.
Table Agile Mart Agile   
Table Compare Utility  This step compares the data from two tables (provided they have the same lay-out). It'll find differences between the data in the two tables and log it.
Table exists Lookup Check if a table exists on a specified connection
Table input Input Read information from a database table.
Table output  Output Write information to a database table
Teradata Fastload Bulk Loader Bulk loading The Teradata Fastload Bulk loader
Teradata TPT Insert Upsert Bulk Loader Bulk loading Bulk loading via TPT using the tbuild command.
Text file input Input Read data from a text file in several formats. This data can then be passed on to the next step(s)...
Text file output Output Write rows to a text file.
Transformation Executor Flow   
Transformation Executor Flow   
Unique rows Transform Remove double rows and leave only unique occurrences. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.
Unique rows (HashSet) Transform Remove double rows and leave only unique occurrences by using a HashSet.
Univariate Statistics Statistics This step computes some simple stats based on a single input field
Update Output Update data in a database table based upon keys
User Defined Java Class Scripting This step allows you to program a step using Java code
User Defined Java Expression Scripting Calculate the result of a Java Expression using Janino
Value Mapper Transform Maps values of a certain field from one value to another
Vertica Bulk Loader Bulk loading Bulk loads data into a Vertica table using their high performance COPY feature 
Web services lookup Lookup Look up information using web services (WSDL)
Knowledge Flow Data Mining Executes a Knowledge Flow data mining process
Write to log  Utility Write data to log
XBase input Input Reads records from an XBase type of database file (DBF)
XML Input Stream (StAX) Input This step is capable of processing very large and complex XML files very fast.
XML Join Joins Joins a stream of XML-Tags into a target XML string
XML Output Output Write data to an XML file
XSD Validator Validation Validate XML source (files or streams) against XML Schema Definition.
XSL Transformation Transform Transform XML stream using XSL (eXtensible Stylesheet Language).
Yaml Input Input Read YAML source (file or stream) parse them and convert them to rows and writes these to one or more output.
Zip File Utility  Creates a standard ZIP archive from the data stream fields 

Replacements for Deprecated and Removed Steps

The following steps are deprecated or have been removed.  Here are suggested replacements for them. 

Removed or Deprecated Step Replacement Step
Aggregate Rows Step Group By Step
Tip:  In the Group By step, leave the Group Field section blank so that the Group By step will aggregate over all rows. 
Get Previous Row Fields Step Analytic Query Step
Tip:  In the Analytic Functions section of the Analytic Query step, select the LAG "N" rows BACKWARD in get Subject option in the Type field.
Google Analytics Input Step Google Analytics Step
Greenplum Bulk Loader Step Greenplum Load Step
LucidDB Bulk Loader Step Table Output Step Tip: Note that LucidDB is EOD.
Streaming XML Input Step XML Input Stream Step or Get data from XML Step
XML Input Step XML Input Stream Step or Get data from XML Step