Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

ETL metadata injection

Parent article

Metadata injection inserts data from various sources into a transformation at runtime. This insertion reduces repetitive ETL tasks. For example, you might have a simple transformation to load transaction data values from a supplier, filter specific values, and output them to a file. If you have more than one supplier, you would need to run this simple transformation for each supplier. Yet, with metadata injection, you can expand this simple repetitive transformation by inserting metadata from another transformation that contains the ETL Metadata Injection step. This step coordinates the data values from the various inputs through the metadata you define. This process reduces the need for you to adjust and run the repetitive transformation for each specific input.

The repetitive transformation is known as the template transformation. The template transformation is called by the ETL Metadata Injection step. You will create a transformation to prepare what common values you want to use as metadata and inject these specific values through the ETL Metadata Injection step as shown in the following diagram:

ETL Metadata Injection Process

We recommend the following basic procedure for using this step to inject metadata:

  1. Optimize your data for injection, such as preparing folder structures and inputs.
  2. Develop transformations for the repetitive process (the template transformation), for metadata injection through the ETL Metadata Injection step, and for handling multiple inputs.

The metadata is injected into the template transformation through any step that supports metadata injection. See Steps supporting metadata injection for which steps support metadata injection.

General

Enter the following information in the transformation step fields:

FieldDescription
Step NameSpecifies the unique name of the ETL Metadata Injection step on the canvas. You can customize the name or leave it as the default.
Transformation

Specify your template transformation by entering in its path. Click Browse to display and enter the path details using the Using the virtual file system browser in PDI.

If you select a transformation that has the same root path as the current transformation, the variable ${Internal.Entry.Current.Directory} will automatically be inserted in place of the common root path. For example, if the current transformation's path is /home/admin/transformation.ktr and you select a transformation in the folder /home/admin/path/sub.ktr than the path will automatically be converted to ${Internal.Entry.Current.Directory}/path/sub.ktr.

If you are working with a repository, specify the name of the template transformation in your repository. If you are not working with a repository, specify the XML file name of the template transformation on your system.

Template transformations previously specified by reference are automatically converted to be specified by name within the Pentaho Repository.

Options

The ETL Metadata Injection step features two tabs with fields. Each tab is described below.

Inject metadata tab

ETL Metadata Injection Step Metadata Tab

The columns of the table in this tab specifies what fields in the template transformation are injected with metadata. The following table describes these columns:

ColumnDescription
Target injection step keyLists the available fields in each step of the template transformation that can be injected with metadata.
Target descriptionDescribes how the target fields relate to their target steps.
Source stepLists the step associated with the fields to be injected into the target fields as metadata.
Source fieldLists the fields to be injected into the target fields as metadata.

Specify the source field

To specify the source field as metadata to be injected, perform the following steps:

Procedure

  1. In the Target injection step key column, double-click the field for which you want to specify a source field.

    The Source field dialog box opens.
  2. Select a source field and click OK.

  3. (Optional) Select Use constant value to specify a constant value for the injected metadata through one of the following actions:

    • Manually entering a value.
    • Using an internal variable to set the value: ${Internal.Step.Unique.Count}, for example.
    • Using a combination of manually specified values and parameter values: ${FILE_PREFIX}_${FILE_DATE}.txt, for example.

Injecting metadata into the ETL Metadata Injection step

For injecting metadata into the ETL Metadata Injection step itself, the following exceptions apply:

  • To inject a method for how to specify a field (such as by FILENAME, REPOSITORY_BY_NAME, or REPOSITORY_BY_REFERENCE), set a TRANS_SPECIFICATION_METHOD constant to the field of an input step. You can then map the field as a source to the TRANS_SPECIFICATION_METHOD constant in the ETL Metadata Injection step.
  • The target field for the ETL Metadata Injection step inserting the metadata into the original injection is defined by [GROUP NAME].[FIELD NAME]. For example, if the GROUP NAME is 'OUTPUT_FIELDS' and the FIELD NAME is 'OUTPUT_FIELDNAME', you would set the target field to 'OUTPUT_FIELDS.OUTPUT_FIELDNAME'.

Options tab

ETL Metadata Injection Step Options Tab

Enter the following optional settings:

OptionDescription
Step to read from (optional)(Optional) Select a step in your template transformation to pass data directly to a step following the ETL Metadata Injection step in your current transformation.
Field nameIf Step to read from is selected, enter the name of the field passed directly from the step in the template transformation.
TypeIf Step to read from is selected, select the type of the field passed directly from the step in the template transformation.
LengthIf Step to read from is selected, enter the length of the field passed directly from the step in the template transformation.
PrecisionIf Step to read from is selected, enter the precision of the field passed directly from the step in the template transformation.
Optional target file (KTR after injection)For initial transformation development or debugging, specify an optional file for creating and saving a transformation of your template after metadata injection occurs. The resulting transformation will be your template transformation with the metadata already injected as constant values.
Streaming source stepSelect a source step in your current transformation to directly pass data to the Streaming target step in the template transformation.
Streaming target stepSelect the target step in your template transformation to receive data directly from the Streaming source step.
Run resulting transformationSelect to inject metadata and run the template transformation. If this option is not selected, metadata injection occurs, but the template transformation does not run.

Example

In this example, you have a template transformation to load transaction data values from a supplier’s spreadsheet, filter specific values to examine, and output them to a text file. The template transformation is injected with metadata from spreadsheets created by you.

In this example, we show you how to use metadata injection through the following steps:

  • Step 1: Create Sample Data: The folder structure and spreadsheet files used to show how data can be set up for metadata injection.
  • Step 2: Develop Your Transformations: The transformations for the main process (the template transformation), metadata injection through the ETL Metadata Injection step, and dealing with multiple supplier files.
  • Step 3: Run and Examine Results: The single source output text file and logging.

This example assumes a basic understanding of working with transformations and steps.

Create sample data

Data files are frequently uploaded from various sources. For this step, we model a situation where two suppliers have uploaded spreadsheets into a data/in folder.

When using metadata injection, you usually want to focus on a subset of data values common to all your input files. For this example, we derive metadata for the following values:
  • Transaction date
  • Transaction invoice number
  • Net value of the transaction
  • Currency used in the transaction

The metadata for these values and the output target text file are created and maintained in the metadata folder.

Procedure

  1. Create a new folder named Pentaho_Metadata_Injection_Example on your computer, and then create the following folder structure inside it.

    ETL Metadata Injection Step Example folder structure
  2. Create a spreadsheet for supplier #1 in Microsoft Excel, and save it as data_format_a.xlsx in the data/in/supplier1 folder.

    Tx Date and TimeCheckInvPriceNetVATGrossVATPCur
    01/26/2016 07:24:001234594061.1362.6911.9174.619EUR
    01/26/2016 07:31:001234694071.1354.6310.3865.0119EUR
  3. Create a spreadsheet for supplier #2 in Microsoft Excel, and save it as data_format_b.xlsx in the data/in/supplier2 folder.

    TransferDataTimeCheckInvCurVATGrossVATPNet
    01/26/2016 07:23:008009013711EUR8.9556.061947.11
    01/26/2016 07:30:008009273712EUR0.945.89194.95
    01/26/2016 07:32:008009903713EUR15.3896.361980.98
  4. Create two tabs in a spreadsheet of metadata fields for suppliers, and save it as metadata_suppliers.xlsx in the metadata folder.

    • The supplier1 tab.
      NameTypeLengthPrecisionTrim typeRepeatFormatCurrencyDecimalGroupingTarget field
      Tx Date and TimeDate-1-1noneNyyyy/MM/dd HH:mm:sstransaction_date
      CheckInteger-1-1noneN
      InvInteger-1-1noneNtransaction_invoice
      PriceBigNumber82noneN
      NetBigNumber82noneNtransaction_net
      VATBigNumber82noneN
      GrossBigNumber82noneN
      VATPBigNumber21noneN
      CurString3-1noneNtransaction_cur
    • The supplier2 tab.
    NameTypeLengthPrecisionTrim typeRepeatFormatCurrencyDecimalGroupingTarget field
    TransferDateTimeDatenoneNtransaction_date
    CheckNumbernoneN
    InvNumbernoneNtransaction_invoice
    CurStringnoneNtransaction_cur
    VATNumbernoneN
    GrossNumbernoneN
    VATPNumbernoneN
    NetNumbernoneNtransaction_net
  5. Create a spreadsheet of metadata fields for targets, and save it as metadata_target.xlsx in the metadata folder.

    NameTypeFormatLengthPrecisionCurrencyDecimalGroupTrim typeNull
    supplierStringboth
    source_filenameStringnone
    source_rowInteger#none
    transaction_dateDateyyyy-MM-ddnone
    transaction_invoiceBigNumber#-1-1none
    transaction_netBigNumber#.##-1-1none
    transaction_curString3both

Develop your transactions

You need to develop a transformation for the main repetitive process, which is often known as the template transform. For this example, you need a transformation (process_supplier_file) to process the transactions in each supplier’s file. Then, the metadata needs to be injected from a transformation (inject_supplier_metadata) developed with the ETL Metadata Injection step. The ETL Metadata Injection step calls the template transformation. Since this example is for inserting data from multiple files, the metadata injection transformation needs to be called from another transformation (process_all_suppliers) per each supplier file.

All total, three transformations need to be developed:

  • Template Transformation: The main repetitive transformation for processing the data per each supplier’s spreadsheet.
  • Metadata Injection Transformation: The transformation defining the structure of the metadata and how it is injected into the main transformation.
  • Transformation for All Suppliers: The transformation going through all the suppliers’ spreadsheets, calling the metadata injection transformation per each supplier, and logging the entire process (for possible troubleshooting, if needed).

Template transformation

With metadata injection, you develop your repetitive, template transformation as you would normally. The main difference is how the settings for each step pertains to the metadata injection, instead of data values of a single specific source.

Procedure

  1. Open a new transformation and save it as process_supplier_file.ktr to the transformations folder.

  2. Drag a Microsoft Excel Input step onto the canvas.

  3. Change Spread sheet type (engine) to Excel 2007 XLSX (Apache POI), and fill out the following values in the Additional output fields tab.

    FieldValue
    Full filename fieldprocessed_filename
    Sheet row nr fieldsource_row
    Short filename fieldsource_filename
  4. Drag a Select values step onto the canvas and connect a hop from Microsoft Excel Input to Select values.

  5. Drag a Get Variables step onto the canvas and connect a hop from Select values to Get Variables, while selecting Main output of step. Also fill out the following values in the Fields table.

    FieldValue
    Namesupplier
    Variable${supplier}
    TypeString
    Trim typenone
  6. Drag a Text file output step onto the canvas and connect a hop from Get Variables to Text file output. Also fill out the following values in the File and Content tabs.

    FieldValue
    Filename (in File tab)${Internal.Entry.Current.Directory}/../data/out/processed_data
    Include date in filename? (in File tab)enable
    Add filenames to result (in File tab)disable
    Append (in Content tab)enable
  7. Save your process_supplier_file.ktr file.

Metadata injection transformation

For this example, our metadata values are maintained in separate spreadsheet files. You need to create a transformation to extract in these values, prepare them for the injection, and then insert them into the template transformation through the ETL Metadata Injection step, as shown in the following figure:

ETL Metadata Injection Step Sample Transformation
Extract the Metadata
For this example, you need to define the input files, access the metadata, and structure the output based on this metadata.

Procedure

  1. Open a new transformation and save it as inject_supplier_metadata.ktr to the transformations folder.

  2. Drag a Get row from result step to the canvas, and fill out the following values in the Fields table.

    #FieldnameTypeLength
    1supplierString500
    2folderNameString500
  3. Drag a Microsoft Excel Input step onto the canvas, name the step Metadata Suppliers.

  4. Browse to Add the ${Internal.Entry.Current.Directory}/../metadata/metadata_suppliers.xlsx file in the Files tab.

    After adding the file, you can verify its path by clicking Show filenames...
  5. Change Spread sheet type (engine) to Excel 2007 XLSX (Apache POI), and fill out the following values in the !Fields tab table.

    #NameTypeLengthPrecisionTrim typeRepeat
    1NameString-1-1noneN
    2TypeString-1-1noneN
    3LengthInteger-1-1noneN
    4PrecisionInteger-1-1noneN
    5Trim typeString-1-1noneN
    6RepeatString-1-1noneN
    7FormatString-1-1noneN
    8CurrencyString-1-1noneN
    9DecimalString-1-1noneN
    10GroupingString-1-1noneN
    11Target fieldString-1-1noneN
  6. In the Additional output fields tab, set Sheetname field to metadata_supplier.

  7. Drag another Microsoft Excel Input step onto the canvas, name the step Metadata Target.

  8. Add the ${Internal.Entry.Current.Directory}/../metadata/metadata_target.xlsx file in the Files tab.

    After adding the file, you can verify its path by clicking Show filenames..
  9. Change Spread sheet type (engine) to Excel 2007 XLSX (Apache POI), and fill out the following values in the !Fields tab table.

    #NameTypeLengthPrecisionTrim typeRepeat
    1NameString-1-1noneN
    2TypeString-1-1noneN
    3FormatString-1-1noneN
    4LengthInteger-1-1noneN
    5PrecisionInteger-1-1noneN
    6CurrencyString-1-1noneN
    7DecimalString-1-1noneN
    8GroupString-1-1noneN
    9Trim typeString-1-1noneN
    10NullString-1-1noneN
  10. In the Additional output fields tab, set Sheetname field to metadata_supplier.

Prepare the metadata

With the supplier files specified and the metadata accessed, you need to prepare this information to be injected into the main transformation.

Procedure

  1. Drag a Join Rows step to the canvas, and connect a hop from the Get rows from result step and Join Rows. Also connect a hop from the Metadata Suppliers step to Join Rows.

  2. Set The condition in Join Rows to metadata_supplier = supplier.

  3. Drag an Add constants step to the canvas, and connect a hop from Get rows from result to Add constants. When the Warning dialog box appears, click Copy.

  4. Fill out the Fields table of Add constants with the following values.

    #NameTypeValueSet empty string?
    1wildcard_includeString.*N
    2wildcard_excludeString N
  5. Drag a Get file names step to the canvas, and connect a hop from Add constants to Get file names. Also fill out the following values in the File tab.

    FieldValue
    Filename is defined in a field?enable
    Get filename from fieldfolderName
    Get wildcard from field (RegExp)wildcard_include
    Exclude wildcard fieldwildcard_exclude
Inject the metadata

With the metadata prepared, you need to associate it with the main transformation for it to be inserted at runtime.

Procedure

  1. Drag the ETL Metadata Injection step to the canvas, and connect three hops:

    • From Join Rows to ETL Metadata Injection
    • From Get File Names to ETL Metadata Injection
    • From Metadata Target to ETL Metadata Injection
  2. In the File tab of the step properties, set ${Internal.Entry.Current.Directory}/process_supplier_file.ktr to Use a file for the transformation template.

  3. Click Validate and Refresh.

  4. In the Inject Metadata tab, set the Source step and Source field for the following fields.

    Target injection step, keySource stepSource field
    Microsoft Excel Input > FIELDS > NAMEJoin RowsName
    Microsoft Excel Input > FIELDS > LENGTHJoin RowsLength
    Microsoft Excel Input > FIELDS > PRECISIONJoin RowsPrecision
    Microsoft Excel Input > FIELDS > FORMATJoin RowsFormat
    Microsoft Excel Input > FIELDS > CURRENCYJoin RowsCurrency
    Microsoft Excel Input > FIELDS > DECIMALJoin RowsDecimal
    Microsoft Excel Input > FIELDS > GROUPJoin RowsGrouping
    Microsoft Excel Input > FIELDS > REPEATJoin RowsRepeat
    Microsoft Excel Input > FIELDS > TYPEJoin RowsType
    Microsoft Excel Input > FILENAME_LINES > FILENAMEGet File Namesfilename
    Select values > METAS > META_NAMEJoin RowsName
    Select values > METAS > RENAMEJoin RowsTarget field
    Text file output > OUTPUT_FIELDS > OUTPUT_FIELDNAMEMetadata TargetName
    Text file output > OUTPUT_FIELDS > OUTPUT_FORMATMetadata TargetFormat
    Text file output > OUTPUT_FIELDS > OUTPUT_LENGTHMetadata TargetLength
    Text file output > OUTPUT_FIELDS > OUTPUT_PRECISIONMetadata TargetPrecision
    Text file output > OUTPUT_FIELDS > OUTPUT_CURRENCYMetadata TargetCurrency
    Text file output > OUTPUT_FIELDS > OUTPUT_DECIMALMetadata TargetDecimal
    Text file output > OUTPUT_FIELDS > OUTPUT_GROUPMetadata TargetGroup
    Text file output > OUTPUT_FIELDS > OUTPUT_NULLMetadata TargetNull
    Text file output > OUTPUT_FIELDS > OUTPUT_TYPEMetadata TargetType
    Text file output > OUTPUT_FIELDS > OUTPUT_TRIMMetadata TargetTrim type
  5. Save your inject_supplier_metadata.ktr file.

Transformation for all suppliers

Since we have multiple input sources, we need a transformation to run through each source and inject the metadata. Each input source is specified through a variable in a Transformation Executor step, which calls to the metadata injection transformation.

Procedure

  1. Open a new transformation and save it as process_all_suppliers.ktr to the transformations folder.

  2. Drag a Get Subfolder names step to the canvas, and add the ${Internal.Entry.Current.Directory}/../data/ in folder in the Folder tab.

  3. Drag a Select values step to the canvas, and connect a hop from Get Subfolder names to Select values.

  4. Fill out the following values in the Fields table of the Select & Alter tab.

    #FieldnameRename to
    1folderName
    2short_folderNamesupplier
  5. Drag a Text file output step to the canvas.

  6. Fill out the following values in the File tab.

    FieldValue
    Filename${Internal.Entry.Current.Directory}/../logging/log
    Include date in filename?enable
    Include time in filename?enable
  7. Drag a Transformation Executor step to the canvas, connect a hop from Select values to Transformation Executor, then select Main output of step in the context menu.

  8. Also, connect a hop from Transformation Executor to Text file output, then select This output will contain the execution results in the context menu.

  9. Fill out the following values in the Parameters and Execution results tabs.

    FieldValue
    File name${Internal.Entry.Current.Directory}/inject_supplier_metadata.ktr
    Variable / Parameter name (in Parameters tab)supplier
    Field to use (in Parameters tab)supplier
    The target step for the execution results (in the Execution Results tab)Text file output
  10. Save your process_all_suppliers.ktr file.

Run and examine results

You run the entire process for all the supplier file by running the process_all_suppliers transformation, which runs the inject_supplier_metadata transformation for each supplier input file. The inject_supplier_metadata transformation then runs the template process_supplier_file transformation.

These transformations create a single source text output file in the data/out folder. The logs generated by the process_all_suppliers transformation are in the logging folder.

Procedure

  1. Run the process_all_suppliers transformation.

  2. Examine the processed_data_{today’s date}.txt file in the data/out folder and the log_{timestamp}.txt file in the logging folder.

Reference links

Below are links to articles and videos about using the ETL Metadata Injection step in PDI.

Articles

The following articles provide more information about the ETL Metadata Injection step.

Video

The following video provides more information about the ETL Metadata Injection step.

https://youtu.be/EjzgzOanq1o