Skip to main content
Pentaho Documentation

Transformation step reference

Steps extend and expand the functionality of PDI transformations. This page contains the list of supported steps.

Steps: A - F

AbortFlowAbort a transformation.
Add a ChecksumTransformAdd a checksum column for each input row.
Add constantsTransformAdd one or more constants to the input rows.
Add sequenceTransformGet the next value from a sequence.
Add value fields changing sequenceTransformAdd sequence depending of fields value change. Each time value of at least one field change, PDI will reset sequence.
Add XMLTransformEncode several fields into an XML fragment.
AMQP ConsumerStreamingPull streaming data from an AMQP broker or clients through an AMQP transformation.
AMQP ProducerStreamingPublish messages in near-real-time to an AMQP broker.
Analytic queryStatisticsExecute analytic queries over a sorted dataset (LEAD/LAG/FIRST/LAST).
Annotate streamFlowRefine your data for the Streamlined Data Refinery by creating measures, link dimensions, or attributes on stream fields.
Append streamsFlowAppend two streams in an ordered way.
ARFF outputData MiningWrite data in ARFF format to a file.
Automatic Documentation OutputOutputGenerate documentation automatically based on input in the form of a list of transformations and jobs.
Avro InputBig DataDecode binary or JSON Avro data and extracts fields from the structure it defines, either from flat files or incoming fields.
Avro input (deprecated)DeprecatedReplaced by Avro Input.
Avro OutputBig DataSerialize data into Avro binary or JSON format from the PDI data stream, then writes it to file.
Block this step until steps finishFlowBlock this step until selected steps finish.
Blocking stepFlowBlock flow until all incoming rows have been processed. Subsequent steps only receive the last input row to this step.
CalculatorTransformCreate new fields by performing simple calculations.
Call DB ProcedureLookupGet back information by calling a database procedure.
Call Endpoint Pentaho Server Call API endpoints from the Pentaho Server within a PDI transformation.
Cassandra InputBig DataRead from a Cassandra column family.
Cassandra OutputBig DataWrite to a Cassandra column family.
Catalog InputCatalogReads the payload of a resource registered with the Lumada Data Catalog and outputs the payload in the form of rows to be used by a transformation.
Catalog OutputCatalogCreates a new Lumada Data Catalog resource or updates an existing resource and saves the metadata associated with this resource.
Change file encodingUtilityChange file encoding and create a new file.
Check if a column existsLookupCheck if a column exists in a table on a specified connection.
Check if file is lockedLookupCheck if a file is locked by another process.
Check if webservice is availableLookupCheck if a webservice is available.
Clone rowUtilityClone a row as many times as needed.
Closure GeneratorTransformGenerate a closure table using parent-child relationships.
Combination lookup/updateData WarehouseUpdate a junk dimension in a data warehouse. Alternatively, look up information in this dimension. The primary key of a junk dimension are all the fields.
Concat FieldsTransformConcatenate multiple fields into one target field. The fields can be separated by a separator and the enclosure logic is completely compatible with the Text File Output step.
Copybook InputInputReads binary data files that are mapped by a fixed-length COBOL copybook definition file.
Copy rows to resultJobWrite rows to the executing job. The information will then be passed to the next entry in this job.
CouchDB InputBig DataRetrieve all documents from a given view in a given design document from a given database.
Credit card validatorValidationDetermines if a credit card number is valid (uses LUHN10 (MOD-10) algorithm), and which credit card vendor handles that number (VISA, MasterCard, Diners Club, EnRoute, American Express (AMEX),...).
CSV File InputInputRead from a simple CSV file input.
Data GridInputEnter rows of static data in a grid, usually for testing, reference or demo purpose.
Data ValidatorValidationValidates passing data based on a set of rules.
Database joinLookupExecute a database query using stream values as parameters.
Database lookupLookupLook up values in a database using field values.
De-serialize from fileInputRead rows of data from a data cube.
Delay rowUtilityOutput each input row after a delay.
DeleteOutputPermanently removes a row from a database.
Detect empty streamFlowOutput one empty row if input stream is empty (I.e. when input stream does not contain any row).
Dimension lookup/updateData WarehouseUpdate a slowly changing dimension in a data warehouse. Alternatively, look up information in this dimension.
Dummy (do nothing)FlowDoes not do anything. It is useful, however, when testing things or in certain situations where you want to split streams.
Dynamic SQL rowLookupExecute dynamic SQL statement build in a previous field.
Edi to XMLUtilityConvert an Edifact message to XML to simplify data extraction.
ElasticSearch Bulk InsertBulk loadingPerform bulk inserts into ElasticSearch.
Email messages inputInputRead POP3/IMAP server and retrieve messages.
ESRI Shapefile ReaderInputRead shape file data from an ESRI shape file and linked DBF file.
ETL metadata injectionFlowInject metadata into an existing transformation prior to execution. This allows for the creation of dynamic and highly flexible data integration solutions.
Example step (deprecated)DeprecatedIs an example of a plugin test step.
Execute a processUtilityExecute a process and return the result.
Execute Row SQL ScriptScriptingExecute an SQL statement or file for every input row.
Execute SQL ScriptScriptingExecute an SQL script, optionally parameterized using input rows.
File exists (Step)LookupCheck if a file exists.
Filter RowsFlowFilter rows using simple equations.
Fixed file inputInputRead from a fixed file input.
FormulaScriptingCalculate a formula using Pentaho's libformula.
Fuzzy matchLookupFind the approximate matches to a string using matching algorithms. Read a field from a main stream and output approximative value from lookup stream.

Steps: G - L

NameCategory Description
Generate random credit card numbersInputGenerate random valid (luhn check) credit card numbers.
Generate random valueInputGenerate random value.
Generate RowsInputGenerate a number of empty or equal rows.
Get data from XMLInputGet data from XML file by using XPath. This step also allows you to parse XML defined in a previous field.
Get File NamesInputGet file names from the operating system and send them to the next step.
Get files from resultJobRead filenames used or generated in a previous entry in a job.
Get Files Rows CountInputGet files rows count.
Get ID from slave serverTransformRetrieve unique IDs in blocks from a slave server. The referenced sequence needs to be configured on the slave server in the XML configuration file.
Get records from streamStreamingReturn records that were previously generated by another transformation in a job.
Get repository namesInputList detailed information about transformations and/or jobs in a repository.
Get rows from resultJobRead rows from a previous entry in a job.
Get Session VariablesPentaho Server Retrieve the value of a session variable.
Get SubFolder namesInputRead a parent folder and return all subfolders.
Get System InfoInputGet information from the system like system date, arguments, etc.
Get table namesInputGet table names from database connection and send them to the next step.
Get VariablesJobDetermine the values of certain (environment or Kettle) variables and put them in field values.
Google AnalyticsInputFetch data from google analytics account.
Greenplum Bulk Loader (deprecated)DeprecatedBulk load Greenplum data. Replacement step is Greenplum Load.
Greenplum LoadBulk loading Bulk load Greenplum data.
Group ByStatisticsBuild aggregates in a group by fashion. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.
GZIP CSV InputInputRead in parallel from a GZIP CSV file.
Hadoop File InputBig DataRead data from a variety of different text-file types stored on a Hadoop cluster.
Hadoop File OutputBig DataWrite data to a variety of different text-file types stored on a Hadoop cluster.
HBase InputBig DataRead from an HBase column family.
HBase OutputBig DataWrite to an HBase column family.
HBase row decoderBig DataDecodes an incoming key and HBase result object to a mapping.
HL7 InputInput Read data from HL7 data streams.
HTTP clientLookupCall a web service over HTTP by supplying a base URL by allowing parameters to be set dynamically.
HTTP PostLookupCall a web service request over HTTP by supplying a base URL by allowing parameters to be set dynamically.
IBM WebSphere MQ Consumer (deprecated)DeprecatedReceive messages from any IBM WebSphere MQ Server.
IBM WebSphere MQ Producer (deprecated)DeprecatedSend messages to any IBM WebSphere MQ Server.
Identify last row in a streamFlowMark the last row.
If field value is nullUtilitySet a field value to a constant if it is null.
Infobright LoaderBulk loadingLoad data to an Infobright database table.
Ingres VectorWise Bulk LoaderBulk loadingInterface with the Ingres VectorWise Bulk Loader "COPY TABLE" command.
InjectorInlineInject rows into the transformation through the java API.
Insert / UpdateOutputUpdate or insert rows in a database based upon keys.
Java FilterFlowFilter rows using java code.
JMS ConsumerStreamingReceive messages from a JMS server.
JMS consumer (deprecated)DeprecatedReplaced by JMS Consumer.
JMS ProducerStreamingSend messages to a JMS server.
JMS producer (deprecated)DeprecatedReplaced by JMS Producer.
Job ExecutorFlow Run a PDI job, and passes parameters and rows.
Join Rows (cartesian product) JoinsOutput the cartesian product of the input streams. The number of rows is the multiplication of the number of rows in the input streams.
JSON InputInputExtract relevant portions out of JSON structures (file or incoming field) and output rows.
JSON outputOutputCreate JSON block and output it in a field to a file.
Kafka ConsumerStreamingRun a sub-transformation that executes according to message batch size or duration, letting you process a continuous stream of records in near-real-time.
Kafka ProducerStreamingPublish messages in near-real-time across worker nodes where multiple, subscribed members have access.
Kinesis ConsumerStreamingExtract data from a specific stream located within the Amazon Kinesis Data Streams service.
Kinesis ProducerStreamingPush data to an existing region and stream located within the Amazon Kinesis Data Streams service.
Knowledge FlowData MiningExecutes a Knowledge Flow data mining process.
LDAP InputInputRead data from LDAP host.
LDAP OutputOutputPerform Insert, upsert, update, add or delete operations on records based on their DN (Distinguished Name).
LDIF InputInputRead data from LDIF files.
Load file content in memoryInputLoad file content in memory.
LucidDB streaming loader (deprecated)DeprecatedLoad data into LucidDB by using Remote Rows UDX.

Steps: M - R

NameCategory Description
MailUtilitySend e-mail.
Mail ValidatorValidationCheck if an email address is valid.
MappingMappingRun a mapping (sub-transformation), use MappingInput and MappingOutput to specify the fields interface.
Mapping Input Specification MappingSpecify the input interface of a mapping.
Mapping Output SpecificationMappingSpecify the output interface of a mapping.
MapReduce InputBig DataEnter Key Value pairs from Hadoop MapReduce.
MapReduce OutputBig DataExit Key Value pairs, then push into Hadoop MapReduce.
MaxMind GeoIP LookupLookupLookup an IPv4 address in a MaxMind database and add fields such as geography, ISP, or organization.
Memory Group ByStatisticsBuild aggregates in a group by fashion. This step doesn't require sorted input.
Merge JoinJoinsJoin two streams on a given key and outputs a joined set. The input streams must be sorted on the join key.
Merge Rows (diff)JoinsMerge two streams of rows, sorted on a certain key. The two streams are compared and the equals, changed, deleted and new rows are flagged.
Metadata structure of streamUtilityRead the metadata of the incoming stream.
Microsoft Access Input InputRead data from a Microsoft Access file
Microsoft Access OutputOutputStore records into an MS-Access database table.
Microsoft Excel InputInputRead data from Excel and OpenOffice Workbooks (XLS, XLSX, ODS).
Microsoft Excel OutputOutputStore records into an Excel (XLS) document with formatting information.
Microsoft Excel WriterOutputWrite or appends data to an Excel file.
Modified Java Script ValueScriptingRun JavaScript programs (and much more).
Mondrian InputInputExecute and retrieve data using an MDX query against a Pentaho Analyses OLAP server (Mondrian).
MonetDB Agile MartAgile
MonetDB Bulk LoaderBulk loadingLoad data into MonetDB by using their bulk load command in streaming mode.
MongoDB InputBig DataRead all entries from a MongoDB collection in the specified database.
MongoDB OutputBig DataWrite to a MongoDB collection.
MQTT ConsumerStreamingPull streaming data from an MQTT broker or clients through an MQTT transformation.
MQTT ProducerStreamingPublish messages in near-real-time to an MQTT broker.
Multiway Merge JoinJoinsJoin multiple streams. This step supports INNER and FULL OUTER joins.
MySQL Bulk LoaderBulk loadingLoad data over a named pipe (not available on MS Windows).
Null if...UtilitySet a field value to null if it is equal to a constant value.
Number rangeTransformCreate ranges based on numeric field.
OLAP InputInputExecute and retrieve data using an MDX query against any XML/A OLAP datasource using olap4j.
OpenERP object delete (deprecated)Deprecated Delete data from the OpenERP server using the XMLRPC interface with the 'unlink' function.
OpenERP object input (deprecated)DeprecatedRetrieve data from the OpenERP server using the XMLRPC interface with the 'read' function.
OpenERP object output (deprecated)DeprecatedUpdate data on the OpenERP server using the XMLRPC interface and the 'import' function
Oracle Bulk LoaderBulk loadingUse Oracle Bulk Loader to load data.
ORC InputBig DataRead fields data from ORC files into a PDI data stream.
ORC OutputBig DataSerialize data from the PDI data stream into an ORC file format and writes it to a file.
Output steps metricsStatisticsReturn metrics for one or several steps.
Palo cell input (deprecated)DeprecatedRetrieve all cell data from a Palo cube.
Palo cell output (deprecated)DeprecatedUpdate cell data in a Palo cube.
Palo dim input (deprecated)DeprecatedReturn elements from a dimension in a Palo database.
Palo dim output (deprecated)DeprecatedCreate/update dimension elements and element consolidations in a Palo database.
Parquet InputBig DataDecode Parquet data formats and extracts fields from the structure it defines.
Parquet OutputBig DataMap fields within data files and choose where you want to process those files.
Pentaho Reporting OutputOutputExecute an existing report file (.prpt).
PostgreSQL Bulk LoaderBulk loadingBulk load PostgreSQL data.
Prioritize streamsFlowPrioritize streams in an order way.
Process filesUtilityProcess one file per row (copy or move or delete). This step only accept filename in input.
Properties OutputOutputWrite data to properties file.
Property InputInputRead data (key, value) from properties files.
Python ExecutorScriptingMap upstream data from a PDI input step or execute a Python script to generate data. When you send all rows, Python stores the dataset in a variable that kicks off your Python script.
Query HCPInputUses the Metadata Query Engine (MQE) to query your Hitachi Content Platform (HCP) repository for objects, their URLs, and system metadata properties.
R script executorStatisticsExecute an R script within a PDI transformation.
Read metadata from HCPInputIdentifies an HCP object by its URL path then specifies a target annotation name to read.
Read metadata from CopybookInputReads a binary fixed-length copybook definition file and outputs the file and column descriptor information as fields to PDI rows.
Read metadataCatalogReads metadata from the Lumada Data Catalog (LDC).
Regex EvaluationScriptingEvaluate regular expressions. This step uses a regular expression to evaluate a field. It can also extract new fields out of an existing field with capturing groups.
Replace in StringTransformReplace all occurrences a word in a string with another word.
Reservoir SamplingStatisticsTransform Samples a fixed number of rows from the incoming stream.
REST ClientLookupConsume RESTful services. REpresentational State Transfer (REST) is a key design idiom that embraces a stateless client-server architecture in which the web services are viewed as resources and can be identified by their URLs
Row DenormaliserTransformDenormalise rows by looking up key-value pairs and by assigning them to new fields in the output rows. This method aggregates and needs the input rows to be sorted on the grouping fields.
Row FlattenerTransformFlatten consecutive rows based on the order in which they appear in the input stream.
Row NormaliserTransformNormalise de-normalised information.
RSS InputInputRead RSS feeds.
RSS OutputOutputRead RSS stream.
Rule ExecutorScriptingExecute a rule against each row (using Drools).
Rule AccumulatorScriptingExecute a rule against a set of rows (using Drools).
Run SSH commandsUtilityRun SSH commands and returns result.

Steps: S - Z

NameCategory Description
S3 CSV InputInputRead from an S3 CSV file.
S3 File OutputOutputExport data to a text file on an Amazon Simple Storage Service (S3).
Salesforce DeleteOutputDelete records in a Salesforce module.
Salesforce InputInputRead information from Salesforce.
Salesforce InsertOutputInsert records in a Salesforce module.
Salesforce UpdateOutputUpdate records in a Salesforce module.
Salesforce UpsertOutputInsert or update records in a Salesforce module.
Sample rowsStatisticsFilter rows based on the line number.
SAP input (deprecated)DeprecatedRead data from SAP ERP, optionally with parameters.
SAS InputInputReads file in sas7bdat (SAS) native format.
ScriptExperimentalCalculate values by scripting in Ruby, Python, Groovy, Javascript, and other scripting languages.
Select ValuesTransformSelect or remove fields in a row. Optionally, set the field meta-data: type, length and precision.
Send message to SyslogUtilitySend message to Syslog server.
Serialize to fileOutputWrite rows of data to a data cube.
Set Field ValueTransformReplace value of a field with another value field.
Set Field Value to a ConstantTransformReplace value of a field to a constant.
Set files in resultJobSet filenames in the result of this transformation. Subsequent job entries can then use this information.
Set Session VariablesPentaho Server Set the value of session variable.
Set VariablesJobSet environment variables based on a single input row.
SFTP PutExperimentalUpload a file or a stream file to a remote host via SFTP.
Shared dimensionFlowRefine your data for the Streamlined Data Refinery through the creation of dimensions which can be shared.
Simple Mapping (sub-transformation)Mapping Turn a repetitive, re-usable part of a transformation (a sequence of steps) into a mapping (sub-transformation).
Single ThreaderFlowExecute a sequence of steps in a single thread.
Socket readerInlineRead a socket. A socket client that connects to a server (Socket Writer step).
Socket writerInlineWrite a socket. A socket server that can send rows of data to a socket reader.
Sort rowsTransformSort rows based upon field values (ascending or descending).
Sorted Merge JoinsMerge rows coming from multiple input steps providing these rows are sorted themselves on the given key fields.
Split field to rowsTransformSplit a single string field by delimiter and creates a new row for each split term.
Split FieldsTransformSplit a single field into more then one.
Splunk InputTransformRead data from Splunk.
Splunk OutputTransformWrite data to Splunk.
SQL File OutputOutputOutput SQL INSERT statements to a file.
Stream lookupLookupLook up values coming from another stream in the transformation.
SSTable OutputBig DataWrite to a filesystem directory as a Cassandra SSTable.
String OperationsTransformApply certain operations like trimming, padding, and others to string value.
Strings cutTransformCut out a snippet of a string.
Switch / CaseFlowSwitch a row to a certain target step based on the case value in a field.
Synchronize after mergeOutputPerform insert/update/delete in one go based on the value of a field.
Table Agile MartAgile
Table CompareUtility Compare the data from two tables (provided they have the same lay-out). It'll find differences between the data in the two tables and log it.
Table existsLookupCheck if a table exists on a specified connection.
Table InputInputRead information from a database table.
Table OutputOutputWrite information to a database table.
Teradata Fastload Bulk LoaderBulk loadingBulk load Teradata Fastload data.
Teradata TPT Insert Upsert Bulk LoaderBulk loadingBulk load via TPT using the tbuild command.
Text File InputInputRead data from a text file in several formats. This data can then be passed to your next step(s).
Text file input (deprecated)DeprecatedReplaced by Text File Input.
Text File OutputOutputWrite rows to a text file.
Text file output (deprecated)DeprecatedReplaced by Text File Output.
Transformation ExecutorFlow Run a PDI transformation, sets parameters, and passes rows.
Unique RowsTransformRemove double rows and leave only unique occurrences. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.
Unique Rows (HashSet)TransformRemove double rows and leave only unique occurrences by using a HashSet.
Univariate StatisticsStatisticsCompute some simple stats based on a single input field.
UpdateOutputUpdate data in a database table based upon keys.
User Defined Java ClassScriptingProgram a step using Java code.
User Defined Java ExpressionScriptingCalculate the result of a Java Expression using Janino.
Value MapperTransformMap values of a certain field from one value to another.
Vertica Bulk LoaderBulk loadingBulk load data into a Vertica table using their high performance COPY feature.
Web services lookupLookupLook up information using web services (WSDL).
Write metadata to HCP objectsOutputWrite custom metadata fields to a Hitachi Content Platform object.
Write metadataCatalogWrites metadata to the Lumada Data Catalog (LDC).
Write to log UtilityWrite data to log.
XBase inputInputRead records from an XBase type of database file (DBF).
XML Input Stream (StAX)InputProcess very large and complex XML files very fast.
XML JoinJoinsJoin a stream of XML-Tags into a target XML string.
XML OutputOutputWrite data to an XML file.
XSD ValidatorValidationValidate XML source (files or streams) against XML Schema Definition.
XSL TransformationTransformTransform XML stream using XSL (eXtensible Stylesheet Language).
Yaml InputInputRead YAML source (file or stream) parse them and convert them to rows and writes these to one or more output.
Zip FileUtility Create a standard ZIP archive from the data stream fields.