Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

PDI transformation tutorial

Parent article

The Data Integration perspective of Spoon allows you to create two basic file types: transformations and jobs. Transformations are used to describe the data flows for ETL such as reading from a source, transforming data and loading it into a target location. Jobs are used to coordinate ETL activities such as defining the flow and dependencies for what order transformations should be run, or prepare for execution by checking conditions such as, "Is my source file available?" or "Does a table exist in my database?"

This exercise will step you through building your first transformation with Pentaho Data Integration introducing common concepts along the way. The exercise scenario includes a flat file (.csv) of sales data that you will load into a database so that mailing lists can be generated. Several of the customer records are missing postal codes (zip codes) that must be resolved before loading into the database. The logic looks like this:

Example Transformation flowchart

Retrieving data from a flat file

To retrieve data from a flat file, connect to a repository, then follow the instructions below.

Procedure

  1. Select File New Transformation in the upper left corner of the Spoon window to create a new transformation.

    New Transformation dialog
  2. Under the Design tab, expand the Input node; then, select and drag a Text File Input step onto the canvas.

  3. Double-click the Text File input step. The Text file input window appears. This window allows you to set the properties for this step.

    Text File Input File tab
  4. In the Step Name field, type Read Sales Data.

    This renames the Text file input step to Read Sales Data.
  5. Click Browse to locate the source file, sales_data.csv, in the ...\design-tools\data-integration\samples\transformations\files folder. (The Browse button appears near the top right side of the window near the File or Directory field.) Click Open​.

    The path to the source file appears in the File or directory field.
  6. Click Add.

    The path to the file appears under Selected Files.
  7. Perform the following steps to look at the contents of the sample file:

    1. Click the Content tab, then set the Format field to Unix​.

    2. Click the File tab again and click the Show file content near the bottom of the window.

    3. The Nr of lines to view window appears. Click the OK button to accept the default.

    4. The Content of first file window displays the file. Examine the file to see how that input file is delimited, what enclosure character is used, and whether or not a header row is present. In the sample, the input file is comma delimited, the enclosure character being a quotation mark (") and it contains a single header row containing field names.

    5. Click the Close button to close the window.

  8. Perform the following steps to provide information about the content:

    1. Click the Content tab. The fields under the Content tab allow you to define how your data is formatted.

    2. Make sure that the Separator is set to comma (,) and that the Enclosure is set to quotation mark ("). Enable Header because there is one line of header rows in the file.

      Text File Input Content tab
    3. Click the Fields tab and click Get Fields to retrieve the input fields from your source file. When the Nr of lines to sample window appears, enter 0 in the field then click OK.

    4. If the Scan Result window displays, click Close to close the window.

    5. To verify that the data is being read correctly, click the Content tab, then click Preview Rows.

    6. In the Enter preview size window, click OK. The Examine preview data window appears. Review the data, then click Close.

    7. Click OK to save the information that you entered in the step.

    8. To save the transformation, select File Save to save the transformation.

    9. The Transformation Properties window appears. In the Transformation Name field, type Getting Started Transformation.

      The Transformation Properties window appears because you are connected to a repository. If you were not connected to the repository, the standard save window would appear. Transformation Properties dialog
    10. Click the folder icon in the Directory field.

    11. Expand the Home directory and double-click the folder in which you want to save the transformation. Your transformation is saved in the Pentaho Repository.

    12. Click OK to close the Transformation Properties window.

Filter records with missing postal codes

After completing Retrieve Data from a Flat File, you are ready to add the next step to your transformation. The source file contains several records that are missing postal codes. Use the Filter Rows transformation step to separate out those records so that you can resolve them in a later exercise.

Procedure

  1. Add a Filter Rows step to your transformation. Under the Design tab, select Flow Filter Rows.

  2. Create a hop between the Read Sales Data step and the Filter Rows step. Hops are used to describe the flow of data in your transformation. To create the hop, click the Read Sales Data (Text File input) step, then press the SHIFT key down and draw a line to the Filter Rows step.

    Create a hop
  3. Double-click the Filter Rows step. The Filter Rows window appears.

  4. In the Step Name field, type Filter Missing Zips.

  5. Under The condition, click <field>.

    The Fields window appears. These are the conditions you can select.
  6. In the Fields window select POSTALCODE and click OK.

  7. Click the comparison operator, (set to = by default), and select the IS NOT NULL from the displayed Functions: window.

  8. Click OK​ to close the Functions: window.

  9. Click OK to exit the Filter Rows window.

    NoteYou will return to this step later to configure the Send true data to step and Send false data to step settings after adding their target steps to your transformation.
  10. Select File Save to save your transformation.

Loading your data into a relational database

After completing Filter Records with Missing Postal Codes, you are ready to take all records exiting the Filter rows step where the POSTALCODE was not null (the true condition), and load them into a database table.

Procedure

  1. Under the Design tab, expand the contents of the Output node.

  2. Click and drag a Table Output step into your transformation. Create a hop between the Filter Missing Zips and Table Output steps. In the dialog box that appears, select Result is TRUE.

  3. Double-click the Table Output step to open its edit properties dialog box.

  4. Rename your Table Output step to Write to Database.

  5. Click New next to the Connection field. You must create a connection to the database.

    The Database Connection dialog box appears.
  6. Provide the settings for connecting to the database.

    FieldSetting
    Connection Name: Sample Data
    Connection Type:Hypersonic
    Host Namelocalhost
    Database Namesampledata
    Port Number9001
    User Namepentaho_admin
    Passwordpassword (If "password" does not work, please check with your system administrator.)
  7. Click Test to make sure your entries are correct. A success message appears. Click OK.

    NoteIf you get an error when testing your connection, ensure that you have provided the correct settings information as described in the table and that the sample database is running. See Start and Stop the Pentaho Server for information about how to start the Pentaho Server.
  8. Click OK, to exit the Database Connections window.

  9. Type SALES_DATA in the Target Table text field.

  10. Since this table does not exist in the target database, you will need use the software to generate the Data Definition Language (DDL) to create the table and execute it. DDLs are the SQL commands that define the different structures in a database such as CREATE TABLE.

    1. In the Table Output window, enable the Truncate Table property.

    2. Click the SQL button at the bottom of the Table output dialog box to generate the DDL for creating your target table.

    3. The Simple SQL editor window appears with the SQL statements needed to create the table.

    4. Click Execute to execute the SQL statement.

    5. The Results of the SQL statements window appears. Examine the results, then click OK to close the Results of the SQL statements window.

    6. Click Close in the Simple SQL editor window to close it.

    7. Click OK to close the Table output window.

  11. Save your transformation.

Retrieving data from your lookup file

After Loading Your Data into a Relational Database, you are ready to retrieve data from your lookup file. You have been provided a second text file containing a list of cities, states, and postal codes that you will now use to look up the postal codes for all of the records where they were missing (the false branch of your Filter rows step). First, you will use a Text file input step to read from the source file, then you will use a Stream lookup step to bring the resolved postal codes into the stream.

Procedure

  1. Add a new Text File Input step to your transformation.

    In this step you will retrieve the records from your lookup file. Do not add a hop yet.
  2. Open the Text File Input step window, then enter Read Postal Codes in the Step name property.

  3. Click Browse to locate the source file, Zipssortedbycitystate.csv, located at ...\design-tools\data-integration\samples\transformations\files.

  4. Click Add.

    The path to the file appears under Selected files.
  5. To look at the contents of the sample file:

    1. Click the Content tab, then set the Format field to Unix​.

    2. Click the File tab again and click the Show file content near the bottom of the window.

    3. The Nr of lines to view window appears. Click the OK button to accept the default.

    4. The Content of first file window displays the file.

      Examine the file to see how that input file is delimited, what enclosure character is used, and whether or not a header row is present. In the example, the input file is comma (,) delimited, the enclosure character being a quotation mark (") and it contains a single header row containing field names.
    5. Click the Close button to close the window.

  6. In the Content tab, change the Separator character to a comma (,). and confirm that the Enclosure setting is a quotation mark (").

    Make sure the Header option is selected.
  7. Under the Fields tab, click Get Fields to retrieve the data from your .csv file.

  8. The Nr of lines to sample window appears. Enter 0 in the field, then click OK.

  9. If the Scan Result window displays, click Close to close it.

  10. Click Preview rows to make sure your entries are correct.

    1. When prompted to enter the preview size, click OK.

    2. Review the information in the window, then click Close.

  11. Click OK to exit the Text File input window.

  12. Save the transformation.

Resolving missing zip code information

After Retrieving Data from Your Lookup File, you can begin to resolve the missing zip codes.

Procedure

  1. Add a Stream Lookup step to your transformation by clicking the Design tab, expanding the Lookup folder, then choosing Stream Lookup.

  2. Draw a hop from the Filter Missing Zips to the Stream lookup step. In the dialog box that appears, select Result is FALSE.

  3. Create a hop from the Read Postal Codes step to the Stream lookup step.

    Missing Zip Codes workflow example
  4. Double-click on the Stream lookup step to open the Stream Value Lookup window.

  5. Rename Stream Lookup to Lookup Missing Zips.

  6. From the Lookup step drop-down box, select Read Postal Codes as the lookup step.

  7. Do the following:

    1. Define the CITY and STATE fields in the key(s) to look up the value(s) table.

    2. In row #1, click the drop down in the Field column and select CITY.

    3. Then, click in the LookupField column and select CITY.

    4. In row #2, click the drop down field in the Field column and select STATE.

    5. Then click in the LookupField column and select STATE.

      Stream Value Lookup Example
  8. Click Get Lookup Fields.

  9. POSTALCODE is the only field you want to retrieve. To delete the CITY and STATE lines, right-click in the line and select Delete Selected Lines.

  10. In the New Name field, give POSTALCODE a new name of ZIP_RESOLVED and make sure the Type is set to String.

  11. Enable Use sorted list (i.s.o. hashtable).Stream Value Lookup Example

  12. Click OK to close the Stream Value Lookup edit properties dialog box.​​

  13. Save your transformation.

  14. To preview the data:

    1. In the canvas, select the Lookup Missing Zips step, then right-click. From the menu that appears, select Preview.

    2. In the Transformation debug dialog window, click Quick Launch to preview the data flowing through this step.

    3. The Examine preview data window appears.

      Notice that the new field, ZIP_RESOLVED, has been added to the stream containing your resolved postal codes.
    4. Click Close to close the window.

    5. If the Select the preview step window appears, click Close.

      The execution results near the bottom of the Spoon window display updated metrics in the Step Metrics tab.

Completing your transformation

After you resolve missing zip code information, the last task is to clean up the field layout on your lookup stream. Cleaning up makes it so that it matches the format and layout of your other stream going to the Write to Database step. Create a Select values step for renaming fields on the stream, removing unnecessary fields, and more.

Procedure

  1. Add a Select Values step to your transformation by expanding the Transform folder and choosing Select Values.

  2. Create a hop from the Lookup Missing Zips to the Select Values step.

  3. Double-click the Select Values step to open its properties dialog box.

  4. Rename the Select Values step to Prepare Field Layout.

  5. Click Get fields to select to retrieve all fields and begin modifying the stream layout.

  6. In the Fields list, find the # column and click the number for the ZIP_RESOLVED field.

    Use CTRLUP to move ZIP_RESOLVED just below the POSTALCODE field (the one that still contains null values).
  7. Select the old POSTALCODE field in the list (line 20) and delete it.Rename values dialog

  8. The original POSTALCODE field was formatted as an 9-character string. You must modify your new field to match the form. Click the Meta-Data tab.

  9. In the first row of the Fields to alter table the meta-data for section, click in the Fieldname column and select ZIP_RESOLVED.

  10. Do the following:

    1. Type POSTALCODE in the Rename to column.

    2. Select String in the Type column, and type 9 in the Length column.

    3. Click OK to exit the edit properties dialog box.

  11. Draw a hop from the Prepare Field Layout (Select values) step to the Write to Database (Table output) step.

  12. When prompted, select the Main output of the step option.

  13. Save your transformation.

    Renaming fields workflow example

Run your transformation

Data Integration provides a number of deployment options. Running a Transformation explains these and other options available for execution. This final part of this exercise to create a transformation focuses exclusively on the Local run option.

Procedure

  1. In the PDI client window, select Action Run.

    The Run Options window appears.
  2. Keep the default Pentaho local option for this exercise.

    It will use the native Pentaho engine and run the transformation on your local machine. See Run configurations if you are interested in setting up configurations that use another engine, such as Spark, to run a transformation.
  3. Click Run.

    The transformation executes. Upon running the transformation, the Execution Results panel opens below the canvas.

Results

The Execution Results section of the window contains several different tabs that help you to see how the transformation executed, pinpoint errors, and monitor performance.
  • The Step Metrics tab provides statistics for each step in your transformation including how many records were read, written, caused an error, processing speed (rows per second) and more. This tab also indicates whether an error occurred in a transformation step. We did not intentionally put any errors in this tutorial so it should run correctly. But, if a mistake had occurred, steps that caused the transformation to fail would be highlighted in red. In the example below, the Lookup Missing Zips step caused an error.Error message display
  • The Logging tab displays the logging details for the most recent execution of the transformation. It also allows you to drill deeper to determine where errors occur. Error lines are highlighted in red. In the example below, the Lookup Missing Zips step caused an error because it attempted to lookup values on a field called POSTALCODE2, which did not exist in the lookup stream.Transformation logging display
  • The Execution History tab provides you access to the Step Metrics and log information from previous executions of the transformation. This feature works only if you have configured your transformation to log to a database through the Logging tab of the Transformation Settings dialog box. For more information on configuring logging or viewing the execution history, see Analyzing Your Transformation Results.
  • The Performance Graph allows you to analyze the performance of steps based on a variety of metrics including how many records were read, written, caused an error, processing speed (rows per second) and more. Like the Execution History, this feature requires you to configure your transformation to log to a database through the Logging tab of the Transformation Settings dialog box.
  • The Metrics tab allows you to see a Gantt chart after the transformation or job has run. This shows you information such as how long it takes to connect to a database, how much time is spent executing a SQL query, or how long it takes to load a transformation.Step metrics tab
  • The Preview Data tab displays a preview of the data.