Skip to main content
Pentaho Documentation

Embedding Pentaho Data Integration

You can get the accompanying sample project from the kettle-sdk-embedding-samples folder of the sample code package. The sample project is bundled with a minimal set of dependencies. In a real-world implementation, projects require the complete set of PDI dependencies that include all .jar files from data-integration/lib.

For each embedding scenario, there is a sample class that can be executed as a stand-alone java application. You can execute the classes manually or run the Ant targets provided in build/build.xml to run the sample classes.

Running Transformations

The org.pentaho.di.sdk.samples.embedding.RunningTransformations class is an example of how to run a PDI transformation from Java code in a stand-alone application. This class sets the parameters and executes the transformation in etl/parametrized_transformation.ktr. The transform can be run from the .ktr file using runTransformationFromFileSystem() or from a PDI repository using runTransfomrationFromRepository(). Important considerations:

  • Always make the first call to KettleEnvironment.init() whenever you are working with the PDI APIs.
  • Prepare the transformation: The definition of a PDI transformation is represented by a TransMeta object. You can load this object from a .ktr file, a PDI repository, or you can generate it dynamically. To query the declared parameters of the transformation definition use listParameters(), or to query the assigned values use setParameterValue().
  • Execute the transformation: An executable Trans object is derived from the TransMeta object that is passed to the constructor. The Trans object starts and then executes asynchronously. To ensure that all steps of the Trans object have completed, call waitUntilFinished().
  • Evaluate the transformation outcome: After the Trans object completes, you can access the result using getResult(). The Result object can be queried for success by evaluating getNrErrors(). This method returns zero (0) on success and a non-zero value when there are errors. To get more information, retrieve the transformation log lines.

Running Jobs

The org.pentaho.di.sdk.samples.embedding.RunningJobs class is an example of how to run a PDI job from Java code in a stand-alone application. This class sets the parameters and executes the job in etl/parametrized_job.kjb. The job can be run from the .kjb file using runJobFromFileSystem() or from a repository using runJobFromRepository(). Important considerations:

  • Always make the first call to KettleEnvironment.init() whenever you are working with the PDI APIs.
  • Prepare the job: The definition of a PDI job is represented by a JobMeta object. You can load this object from a .ktb file, a PDI repository, or you can generate it dynamically. To query the declared parameters of the job definition use listParameters(). To set the assigned values use setParameterValue().
  • Execute the job: An executable Job object is derived from the JobMeta object that is passed in to the constructor. The Job object starts, and then executes in a separate thread. To wait for the job to complete, call waitUntilFinished().
  • Evaluate the job outcome: After the Job completes, you can access the result using getResult(). The Result object can be queried for success using getResult(). This method returns true on success and false on failure. To get more information, retrieve the job log lines.

Building Transformations Dynamically

The org.pentaho.di.sdk.samples.embedding.GeneratingTransformations class is an example of a dynamic transformation. This class generates a transformation definition and saves it to a .ktr file. Important considerations:

  • Always make the first call to KettleEnvironment.init() whenever you are working with the PDI APIs.
  • Create and configure a transformation definition object: A transformation definition is represented by a TransMeta object. Create this object using the default constructor. The transformation definition includes the name, the declared parameters, and the required database connections. 
  • Populate the TransMeta object with steps: The data flow of a transformation is defined by steps that are connected by hops.
    1. Create the step by instantiating its class directly and configure it using its get and set methods. Transformation steps reside in sub-packages of org.pentaho.di.trans.steps. For example, to use the Get File Names step , create an instance of org.pentaho.di.trans.steps.getfilenames.GetFileNamesMeta and use its get and set methods to configure it.
    2. Obtain the step id string. Each PDI step has an id that can be retrieved from the PDI plugin registry. A simple way to retrieve the step id is to call PluginRegistry.getInstance().getPluginId(StepPluginType.class, theStepMetaObject)
    3. Create an instance of org.pentaho.di.trans.step.StepMeta, passing the step id string, the name, and the configured step object to the constructor. An instance of StepMeta encapsulates the step properties, as well as controls the placement of the step on the PDI client (Spoon) canvas and connections to hops. Once the StepMeta object has been created, call setDrawn(true) and setLocation(x,y) to make sure the step appears correctly on the PDI client canvas. Finally, add the step to the transformation, by calling addStep() on the transformation definition object.
    4. Once steps have been added to the transformation definition, they need to be connected by hops. To create a hop, create an instance of org.pentaho.di.trans.TransHopMeta, passing in the From and To steps as arguments to the constructor. Add the hop to the transformation definition by calling addTransHop().
After all steps have been added and connected by hops, the transformation definition object can be serialized to a .ktr file by calling getXML() and opening it in the PDI client for inspection. The sample class org.pentaho.di.sdk.samples.embedding.GeneratingTransformations generates the transformation shown below:

Building Jobs Dynamically

The org.pentaho.di.sdk.samples.embedding.GeneratingJobs class is an example of a dynamic job. This class generates a job definition and saves it to a .kjb file. Important considerations:

  • Always make the first call to KettleEnvironment.init() whenever you are working with the PDI APIs.
  • Create and configure a job definition object: A job definition is represented by a JobMeta object. Create this object using the default constructor. The job definition includes the name, the declared parameters, and the required database connections. 
  • Populate the JobMeta object with job entries: The control flow of a job is defined by job entries that are connected by hops.
    1. Create the job entry by instantiating its class directly and configure it using its get and set methods. The job entries reside in sub-packages of org.pentaho.di.job.entries. For example, use the File Exists job entry, create an instance of org.pentaho.di.job.entries.fileexists.JobEntryFileExists, and use setFilename() to configure it. The Start job entry is implemented by org.pentaho.di.job.entries.special.JobEntrySpecial.
    2. Create an instance of org.pentaho.di.job.entry.JobEntryCopy by passing the job entry created in the previous step to the constructor. An instance of JobEntryCopy encapsulates the properties of a job entry, as well as controls the placement of the job entry on the PDI client canvas and connections to hops. Once created, call setDrawn(true) and setLocation(x,y) to make sure the job entry appears correctly on the PDI client canvas. Finally, add the job entry to the job by calling addJobEntry() on the job definition object. It is possible to place the same job entry in several places on the canvas by creating multiple instances of JobEntryCopy and passing in the same job entry instance. 
    3. Once job entries have been added to the job definition, they need to be connected by hops. To create a hop, create an instance of org.pentaho.di.job.JobHopMeta, passing in the From and To job entries as arguments to the constructor. Configure the hop consistently. Configure it as a green or red hop by calling setConditional() and setEvaluation(true/false). If it is an unconditional hop, call setUnconditional(). Add the hop to the job definition by calling addJobHop().

After all job entries have been added and connected by hops, the job definition object can be serialized to a .kjb file by calling getXML(), and opened in the PDI client for inspection. The sample class org.pentaho.di.sdk.samples.embedding.GeneratingJobs generates the job shown below:

Obtaining Logging Information

When you need more information about how transformations and jobs execute, you can view PDI log lines and text.

PDI collects log lines in a central place. The org.pentaho.di.core.logging.KettleLogStore class manages all log lines and provides methods for retrieving the log text for specific entities. To retrieve log text or log lines, supply the log channel id generated by PDI during runtime. You can obtain the log channel id by calling getLogChannelId(), which is part of LoggingObjectInterface. Jobs, transformations, job entries, and transformation steps all implement this interface. 

For example, assuming the job variable is an instance of a running or completed job, the following code shows how you retrieve the job's log lines:

LoggingBuffer appender = KettleLogStore.getAppender();
String logText = appender.getBuffer(job.getLogChannelId(), false).toString();

The main methods in the sample classes org.pentaho.di.sdk.samples.embedding.RunningJobs and org.pentaho.di.sdk.samples.embedding.RunningTransformations retrieve log information from the executed job or transformation in this manner.

Exposing a Transformation or Job as a Web Service

Running a PDI job or transformation as part of a web-service is implemented by writing a servlet that maps incoming parameters for a transformation or job entry and executes them as part of the request cycle.

Instead of writing a servlet, you can use the Carte server or the Pentaho Server directly by building a transformation that writes its output to the HTTP response of the Carte server. This is achieved by using the Pass Output to Servlet feature of the Text output, XML output, JSON output, or scripting steps. For an example, run the sample transformation, /data-integration/samples/transformations/Servlet Data Example.ktr, on Carte.

Using Non-Native Plugins

To use non-native plugins with an embedded Pentaho Server, you must configure the server to find where the plugins reside. How you configure the server depends on whether your plugin is a folder with associated files or a single JAR file.

If your plugins are folders with associated files, register the folders by setting the KETTLE_PLUGIN_BASE_FOLDERS system property just before the call to KettleEnvironment.init(), as shown in the following example for the “plugins” and “plugins2” plugins:

System.setProperty("KETTLE_PLUGIN_BASE_FOLDERS", "C:\\pentaho\\data-integration\\plugins,c:\\plugins2");
KettleEnvironment.init();

If your plugin is a single JAR file, annotate the classes for the plugin and include them in the class path, then set the KETTLE_PLUGIN_CLASSES system property to register the fully-qualified class names just before the call to KettleEnvironment.init(), as shown in the following example for a “jsonoutput” plugin:

System.setProperty("KETTLE_PLUGIN_CLASSES","org.pentaho.di.trans.steps.jsonoutput.JsonOutputMeta");
KettleEnvironment.init();

Refer to the Extend Pentaho Data Integration article for more information on creating plugins.

If you have custom job entries or custom transformation steps, you must use one of the above two methods to configure the locations where the embedded server will search for your custom job entries or custom transformation steps.