ETL Metadata Injection
Overview
The ETL Metadata Injection step is capable of injecting step metadata into a template transformation.
The ETL Metadata Injection step inserts metadata into a template transformation. Instead of statically entering ETL metadata in a step dialog, you pass it at run-time. This step enables you to solve repetitive ETL workloads like loading of text files, data migration, and so on.
The following steps support metadata injection:
Step | Version Introduced | Fields Supporting Metadata Injection |
---|---|---|
Concat Fields | 5.1 | All fields |
CSV File Input | 4.1 | See CSV File Input for a list of supported fields |
Data Grid | 5.1 | All fields |
Fixed File Input | 4.1 | See Fixed File Input for a list of supported fields |
Get Data from XML | 5.0 | See Get Data from XML for a list of supported fields |
Google Analytics | 6.1 | All fields |
Group By | 5.0 | All fields |
GZIP CSV Input | 5.1 | All fields |
JSON Output | 5.2 | All fields |
Microsoft Access Input | 5.0 | See Microsoft Access Input for a list of supported fields |
Microsoft Excel Input | 4.1 | See Microsoft Excel Input for a list of supported fields |
Microsoft Excel Output | 5.1 | All fields (as of 6.1) |
Microsoft Excel Writer | 5.3 | See Microsoft Excel Writer for a list of supported fields |
Pentaho Reporting Output | 5.0 | All fields |
PostgreSQL Bulk Loader | 5.1 | All fields |
Row Denormaliser | 4.2 | See Row Denormaliser for a list of supported fields |
Row Normaliser | 4.2 | See Row Normaliser for a list of supported fields |
S3 CSV Input | 6.1 | All fields |
S3 File Output | 6.1 | All fields |
Select Values | 4.1 | All fields |
Sort Rows | 5.0 | All fields |
Split Field | 5.0 | All fields (as of 6.1) |
Stream Lookup | 6.1 | All fields |
Table Input | 5.2 | See Table Input for a list of supported fields |
Table Output | 5.1 | See Table Output for a list of supported fields |
Text File Input | 5.0 | All fields |
Text File Output | 5.2 | All fields |
User Defined Java Expression | 5.2 | All fields |
Value Mapper | 6.1 | All fields |
XML Output | 6.1 | All fields |
Options
Option | Description |
---|---|
Transformation template | In this section of the dialog, you can specify the transformation to use as a template. When you have specified a transformation, you can use the Validate and Refresh button. The Edit button will open the specified template in a new tab in Spoon. |
Template step to read from (optional) | If you specify a step from the template here, then the output of the ETL Metadata Injection step will be the output from the source step. |
Optional target file (KTR after injection) | For debugging or transformation generation, you can save the resulting transformation filename, after metadata injection, to a file. If you want, you can specify a file name, result.ktr for example. |
Don't execute resulting transformation | If you prefer to not execute the resulting transformation (after metadata injection), enable this option. |
Field mapping | You can select any row in the metadata tree table with your mouse, which pops up a source step and field selection dialog. |
Data Streaming
Since version 5.1, this step is capable of streaming data from one transformation into another.
To pass data from your template transformation (after injection, during execution) to your current transformation, specify Template step to read from. You can also specify the expected output fields easily design the steps which come after the ETL Metadata Injection step.
To pass data from a source step into the template transformation (again, after injection) you can specify Streaming source step and Streaming target step in the template transformation.
Example
Metadata injection inserts data from various sources into your transformation at runtime. This insertion reduces repetitive ETL tasks for various input and out files.
For example, you might have a simple transformation to load transaction data values from a supplier’s spreadsheet, filter out specific values to examine, and output them to a text file. You can expand this simple repetitive “template” transformation with metadata injection to load data values from multiple suppliers’ spreadsheets in various folders, filter out common specific transaction values to examine, and output all of it to a single source text file.
In this example, we show you how to use metadata injection through these steps:
- Step 1: Create Sample Data – the folder structure and spreadsheet files used to show how data can be set up for metadata injection
- Step 2: Develop Your Transformations – the transformations for the main process (the template transformation), metadata injection through the ETL Metadata Injection step, and dealing with multiple supplier files
- Step 3: Run and Examine Results – the single source output text file and logging
This example assumes a basic understanding of working with transformations and steps.
Step 1: Create Sample Data
Data files are frequently uploaded from various sources. For this step, we model a situation where two suppliers have uploaded spreadsheets into a data/in folder.
When using metadata injection, you usually want to focus on a subset of data values common to all your input files. For this example, we derive metadata for the following values:
- Transaction date
- Transaction invoice number
- Net value of the transaction
- Currency used in the transaction
The metadata for these values and the output target text file are created and maintained in the metadata folder.
- Create a new folder named Pentaho_Metadata_Injection_Example on your computer, and then create the following folder structure inside it.
- Create a spreadsheet for supplier #1 in Microsoft Excel, and save it as data_format_a.xlsx in the data/in/supplier1 folder.
Tx Date and Time | Check | Inv | Price | Net | VAT | Gross | VATP | Cur |
---|---|---|---|---|---|---|---|---|
01/26/2016 07:24:00 | 12345 | 9406 | 1.13 | 62.69 | 11.91 | 74.6 | 19 | EUR |
01/26/2016 07:31:00 | 12346 | 9407 | 1.13 | 54.63 | 10.38 | 65.01 | 19 | EUR |
- Create a spreadsheet for supplier #2 in Microsoft Excel, and save it as data_format_b.xlsx in the data/in/supplier2 folder.
TransferDataTime | Check | Inv | Cur | VAT | Gross | VATP | Net |
---|---|---|---|---|---|---|---|
01/26/2016 07:23:00 | 800901 | 3711 | EUR | 8.95 | 56.06 | 19 | 47.11 |
01/26/2016 07:30:00 | 800927 | 3712 | EUR | 0.94 | 5.89 | 19 | 4.95 |
01/26/2016 07:32:00 | 800990 | 3713 | EUR | 15.38 | 96.36 | 19 | 80.98 |
- Create two tabs in a spreadsheet of metadata fields for suppliers, and save it as metadata_suppliers.xlsx in the metadata folder.
- The s upplier1 tab.
Name | Type | Length | Precision | Trim type | Repeat | Format | Currency | Decimal | Grouping | Target field |
---|---|---|---|---|---|---|---|---|---|---|
Tx Date and Time | Date | -1 | -1 | none | N | yyyy/MM/dd HH:mm:ss | transaction_date | |||
Check | Integer | -1 | -1 | none | N | |||||
Inv | Integer | -1 | -1 | none | N | transaction_invoice | ||||
Price | BigNumber | 8 | 2 | none | N | , | . | |||
Net | BigNumber | 8 | 2 | none | N | , | . | transaction_net | ||
VAT | BigNumber | 8 | 2 | none | N | , | . | |||
Gross | BigNumber | 8 | 2 | none | N | , | . | |||
VATP | BigNumber | 2 | 1 | none | N | , | . | |||
Cur | String | 3 | -1 | none | N | transaction_cur |
- The supplie r2 tab.
Name | Type | Length | Precision | Trim type | Repeat | Format | Currency | Decimal | Grouping | Target field |
---|---|---|---|---|---|---|---|---|---|---|
TransferDateTime | Date | none | N | transaction_date | ||||||
Check | Number | none | N | |||||||
Inv | Number | none | N | transaction_invoice | ||||||
Cur | String | none | N | transaction_cur | ||||||
VAT | Number | none | N | |||||||
Gross | Number | none | N | |||||||
VATP | Number | none | N | |||||||
Net | Number | none | N | transaction_net |
- Create a spreadsheet of metadata fields for targets, and save it as metadata_target.xlsx in the metadata folder.
Name | Type | Format | Length | Precision | Currency | Decimal | Group | Trim type | Null |
---|---|---|---|---|---|---|---|---|---|
supplier | String | both | |||||||
source_filename | String | none | |||||||
source_row | Integer | # | none | ||||||
transaction_date | Date | yyyy-MM-dd | none | ||||||
transaction_invoice | BigNumber | # | -1 | -1 | none | ||||
transaction_net | BigNumber | #.## | -1 | -1 | none | ||||
transaction_cur | String | 3 | both |
Step 2: Develop Your Transactions
You need to develop a transformation for the main repetitive process, which is often known as the template transform. For this example, you need a transformation (process_supplier_file) to process the transactions in each supplier’s file. Then, the metadata needs to be injected from a transformation (inject_supplier_metadata) developed with the ETL Metadata Injection step. The ETL Metadata Injection step calls the template transformation. Since this example is for inserting data from multiple files, the metadata injection transformation needs to be called from another transformation (process_all_suppliers) per each supplier file.
All total, three transformations need to be developed:
- Template Transformation – The main repetitive transformation for processing the data per each supplier’s spreadsheet.
- Metadata Injection Transformation – The transformation defining the structure of the metadata and how it is injected into the main transformation.
- Transformation for All Suppliers – The transformation going through all the suppliers’ spreadsheets, calling the metadata injection transformation per each supplier, and logging the entire process (for possible troubleshooting, if needed).
Template Transformation
With metadata injection, you develop your repetitive, template transformation as you would normally. The main difference is how the settings for each step pertains to the metadata injection, instead of data values of a single specific source.
- Open a new transformation and save it as process_supplier_file.ktr to the transformations folder.
- Drag a Microsoft Excel Input step onto the canvas.
- Change Spread sheet type (engine) to Excel 2007 XLSX (Apache POI), and fill out the following values in the Additional output fields tab.
Field | Value |
---|---|
Full filename field | processed_filename |
Sheet row nr field | source_row |
Short filename field | source_filename |
- Drag a Select values step onto the canvas and connect a hop from Microsoft Excel Input to Select values.
- Drag a Get Variables step onto the canvas and connect a hop from Select values to Get Variables, while selecting Main output of step. Also fill out the following values in the Fields table.
Field | Value |
---|---|
Name | supplier |
Variable | ${supplier} |
Type | String |
Trim type | none |
- Drag a Text file output step onto the canvas and connect a hop from Get Variables to Text file output. Also fill out the following values in the File and Content tabs.
Field | Value |
---|---|
Filename (in File tab) | ${Internal.Entry.Current.Directory}/../data/out/processed_data |
Include date in filename? (in File tab) | enable |
Add filenames to result (in File tab) | disable |
Append (in Content tab) | enable |
- Save your process_supplier_file.ktr file.
Metadata Injection Transformation
For this example, our metadata values are maintained in separate spreadsheet files. You need to create a transformation to extract in these values, prepare them for the injection, and then insert them into the template transformation through the ETL Metadata Injection step, as shown in the following figure:
Extract the Metadata
For this example, you need to define the input files, access the metadata, and structure the output based on this metadata.
- Open a new transformation and save it as inject_supplier_metadata.ktr to the transformations folder.
- Drag a Get row from result step to the canvas, and fill out the following values in the Fields table.
# | Fieldname | Type | Length |
---|---|---|---|
1 | supplier | String | 500 |
2 | folderName | String | 500 |
- Drag a Microsoft Excel Input step onto the canvas, name the step Metadata Suppliers.
- Browse to Add the ${Internal.Entry.Current.Directory}/../metadata/metadata_suppliers.xlsx file in the Files tab. After adding the file, you can verify its path by clicking Show filenames...
- Change Spread sheet type (engine) to Excel 2007 XLSX (Apache POI), and fill out the following values in the !Fields tab table.
# | Name | Type | Length | Precision | Trim type | Repeat |
---|---|---|---|---|---|---|
1 | Name | String | -1 | -1 | none | N |
2 | Type | String | -1 | -1 | none | N |
3 | Length | Integer | -1 | -1 | none | N |
4 | Precision | Integer | -1 | -1 | none | N |
5 | Trim type | String | -1 | -1 | none | N |
6 | Repeat | String | -1 | -1 | none | N |
7 | Format | String | -1 | -1 | none | N |
8 | Currency | String | -1 | -1 | none | N |
9 | Decimal | String | -1 | -1 | none | N |
10 | Grouping | String | -1 | -1 | none | N |
11 | Target field | String | -1 | -1 | none | N |
- In the Additional output fields tab, set Sheetname field to metadata_supplier.
- Drag another Microsoft Excel Input step onto the canvas, name the step Metadata Target.
- Add the ${Internal.Entry.Current.Directory}/../metadata/metadata_target.xlsx file in the Files tab. After adding the file, you can verify its path by clicking Show filenames...
- Change Spread sheet type (engine) to Excel 2007 XLSX (Apache POI), and fill out the following values in the !Fields tab table.
# | Name | Type | Length | Precision | Trim type | Repeat |
---|---|---|---|---|---|---|
1 | Name | String | -1 | -1 | none | N |
2 | Type | String | -1 | -1 | none | N |
3 | Format | String | -1 | -1 | none | N |
4 | Length | Integer | -1 | -1 | none | N |
5 | Precision | Integer | -1 | -1 | none | N |
6 | Currency | String | -1 | -1 | none | N |
7 | Decimal | String | -1 | -1 | none | N |
8 | Group | String | -1 | -1 | none | N |
9 | Trim type | String | -1 | -1 | none | N |
10 | Null | String | -1 | -1 | none | N |
- In the Additional output fields tab, set Sheetname field to metadata_supplier.
Prepare the Metadata
With the supplier files specified and the metadata accessed, you need to prepare this information to be injected into the main transformation.
- Drag a Join Rows step to the canvas, and connect a hop from the Get rows from result step and Join Rows. Also connect a hop from the Metadata Suppliers step to Join Rows.
- Set The condition in Join Rows to metadata_supplier = supplier.
- Drag an Add constants step to the canvas, and connect a hop from Get rows from result to Add constants. When the Warning dialog appears, click Copy.
- Fill out the Fields table of Add constants with the following values.
# | Name | Type | Value | Set empty string? |
---|---|---|---|---|
1 | wildcard_include | String | .* | N |
2 | wildcard_exclude | String | N |
- Drag a Get file names step to the canvas, and connect a hop from Add constants to Get file names. Also fill out the following values in the File tab.
Field | Value |
---|---|
Filename is defined in a field? | enable |
Get filename from field | folderName |
Get wildcard from field (RegExp) | wildcard_include |
Exclude wildcard field | wildcard_exclude |
Inject the Metadata
With the metadata prepared, you need to associate it with the main transformation for it to be inserted at runtime.
- Drag the ETL Metadata Injection step to the canvas, and connect three hops:
- From Join Rows to ETL Metadata Injection
- From Get File Names to ETL Metadata Injection
- From Metadata Target to ETL Metadata Injection
- In the File tab of the step properties, set ${Internal.Entry.Current.Directory}/process_supplier_file.ktr to Use a file for the transformation template.
- Click Validate and Refresh.
- In the Inject Metadata tab, set the Source step and Source field for the following fields.
Target injection step, key | Source step | Source field | |
---|---|---|---|
Microsoft Excel Input > FIELDS > NAME | Join Rows | Name | |
Microsoft Excel Input > FIELDS > LENGTH | Join Rows | Length | |
Microsoft Excel Input > FIELDS > PRECISION | Join Rows | Precision | |
Microsoft Excel Input > FIELDS > FORMAT | Join Rows | Format | |
Microsoft Excel Input > FIELDS > CURRENCY | Join Rows | Currency | |
Microsoft Excel Input > FIELDS > DECIMAL | Join Rows | Decimal | |
Microsoft Excel Input > FIELDS > GROUP | Join Rows | Grouping | |
Microsoft Excel Input > FIELDS > REPEAT | Join Rows | Repeat | |
Microsoft Excel Input > FIELDS > TYPE | Join Rows | Type | |
Microsoft Excel Input > FILENAME_LINES > FILENAME | Get File Names | filename | |
Select values > METAS > META_NAME | Join Rows | Name | |
Select values > METAS > RENAME | Join Rows | Target field | |
Text file output > OUTPUT_FIELDS > OUTPUT_FIELDNAME | Metadata Target | Name | |
Text file output > OUTPUT_FIELDS > OUTPUT_FORMAT | Metadata Target | Format | |
Text file output > OUTPUT_FIELDS > OUTPUT_LENGTH | Metadata Target | Length | |
Text file output > OUTPUT_FIELDS > OUTPUT_PRECISION | Metadata Target | Precision | |
Text file output > OUTPUT_FIELDS > OUTPUT_CURRENCY | Metadata Target | Currency | |
Text file output > OUTPUT_FIELDS > OUTPUT_DECIMAL | Metadata Target | Decimal | |
Text file output > OUTPUT_FIELDS > OUTPUT_GROUP | Metadata Target | Group | |
Text file output > OUTPUT_FIELDS > OUTPUT_NULL | Metadata Target | Null | |
Text file output > OUTPUT_FIELDS > OUTPUT_TYPE | Metadata Target | Type | |
Text file output > OUTPUT_FIELDS > OUTPUT_TRIM | Metadata Target | Trim type |
- Save your inject_supplier_metadata.ktr file.
Transformation for All Suppliers
Since we have multiple input sources, we need a transformation to run through each source and inject the metadata. Each input source is specified through a variable in a Transformation Executor step, which calls to the metadata injection transformation.
- Open a new transformation and save it as process_all_suppliers.ktr to the transformations folder.
- Drag a Get Subfolder names step to the canvas, and add the ${Internal.Entry.Current.Directory}/../data/in folder in the Folder tab.
- Drag a Select values step to the canvas, and connect a hop from Get Subfolder names to Select values.
- Fill out the following values in the Fields table of the Select & Alter tab.
# | Fieldname | Rename to |
---|---|---|
1 | folderName | |
2 | short_folderName | supplier |
- Drag a Text file output step to the canvas.
- Fill out the following values in the File tab.
Field | Value |
---|---|
Filename | ${Internal.Entry.Current.Directory}/../logging/log |
Include date in filename? | enable |
Include time in filename? | enable |
- Drag a Transformation Executor step to the canvas, connect a hop from Select values to Transformation Executor, then select Main output of step in the context menu.
- Also, connect a hop from Transformation Executor to Text file output, then select This output will contain the execution results in the context menu.
- Fill out the following values in the Parameters and Execution results tabs.
Field | Value |
---|---|
File name | ${Internal.Entry.Current.Directory}/inject_supplier_metadata.ktr |
Variable / Parameter name (in Parameters tab) | supplier |
Field to use (in Parameters tab) | supplier |
The target step for the execution results (in the Execution Results tab) | Text file output |
- Save your process_all_suppliers.ktr file.
Step 3: Run and Examine Results
You run the entire process for all the supplier file by running the process_all_suppliers transformation, which runs the inject_supplier_metadata transformation for each supplier input file. The inject_supplier_metadata transformation then runs the template process_supplier_file transformation.
These transformations create a single source text output file in the data/out folder. The logs generated by the process_all_suppliers transformation are in the logging folder.
- Run the process_all_suppliers transformation.
- Examine the processed_data_{today’s date}.txt file in the data/out folder and the log_{timestamp}.txt file in the logging folder.