The Pentaho Metaverse provides metadata lineage capabilities for the Pentaho Universe. Pentaho Data Integration (PDI) is a major source of lineage information. The metaverse mines metadata and builds a connected relationship model among all the pieces it knows about. The end result is a graph model which allows for lineage (finding where/what contributed to something) and impact analysis (determining what would be affected downstream if something where changed). The metaverse leverages OSGi (blueprints) to allow for modularity and extensibility. Therefore, if something is not supported out-of-the-box by the metaverse, the metaverse can accept components via OSGi bundles which extend its capabilities.
Kettle supports transformations and jobs, each of which is composed of smaller bite-sized operations. A transformation is made up of steps and a job is made up of job entries. Conceptually, these can be thought of as analogs. Kettle provides hundreds of unique steps and job entries which each perform a specific task. As far as the metaverse is concerned, each one of these steps and job entries is a potential source of metadata with respect to lineage.
The metaverse is composed of analyzers which are responsible for mining lineage information from a specific "thing." There are document analyzers which know how to extract the lineage information from documents. PDI produces two document types, transformations (KTR) and jobs (KJB), and for each there is a corresponding document analyzer. Each one analyzes the sub-components, the steps comprising a transformation and the job entries comprising a job, and assigns each subcomponent a specific step analyzer or job entry analyzer if one exists for the implementation of BaseStepMeta.
The out-of-the-box set of analyzers is limited. In the case of a step or job entry not having a corresponding analyzer, there is a generic fallback analyzer. To contribute a new step or job entry analyzer to the system, you can implement the required interface(s) and register a service via OSGI (blueprints) to become available to the system.
The preferred approach to adding a step analyzer is through OSGi. We realize that there are many legacy PDI plugins which provide steps and job entries. While it is not part of this project to convert these plugins to OSGi for the purpose of contributing analyzers, you can add step analyzers to an existing PDI plugin. Jump down to the section dedicated to this purpose.
This example will outline the steps taken to create the sample-metaverse-bundle. It demonstrates how to create a new step analyzer for the Table Input Step.
Create a New Maven Project
The easiest way to get started is to use the karaf-bundle-archetype to create a new Maven project which generates a bundle artifact which works in a Karaf container.
Add maven dependencies to pentaho-metaverse-api and kettle jars in your pom.xml file.
Create a Class which Implements IStepAnalyzer
At a minimum, you will need to create a java class which implements the IStepAnalyzer interface (for a job entry analyzer, you would implement IJobEntryAnalyzer). The IStepAnalyzer interface only requires that you implement the analyzer and getSupportedSteps methods. It is pretty black-box and doesn't do much to make the developer's life much easier. Step analyzers follow a common pattern:
- Model the step itself in the graph as a node.
- Link all stream fields which are inputs into the step to that node, if any.
- Determine the outputs of the step, if any, then create and link those nodes to the step node.
- Add links to the fields which the step actually uses, if any.
- Add links from the input fields to the output fields.
Virtually, all implementations would benefit by extending the common base class StepAnalyzer which provides a common implementation for all of those common tasks. Below, is a simple implementation of a step analyzer for the Dummy step. There is nothing special about this step which warrants a custom step analyzer, but for the purpose of this document we will add a custom property to the step node. This is done in the customAnalyzer method:
Create the Blueprint Configuration
Blueprint provides a dependency injection framework for OSGi. The metaverse has two injection points. It has a reference list of all services registered in the container for both the IStepAnalyzer interfaces and the IJobEntryAnalyzer interfaces. When the container detects a new service which provides an implementation of one of those interfaces, the metaverse sees it and adds it to its set of known analyzers. The next time a step which implements the particular class you care about, such as 'DummyTransMeta' in our example, is analyzed, your new StepAnalyzers will be used and your override methods will be called.
Create a blueprint.xml file in src/main/resources/OSGI-INF/blueprint/ folder. (Create the folders, if needed.)
Build and Test Your Bundle
1. Build your bundle with Maven and have it installed into your local Maven repository. Once there, you can test it out in the Pentaho di-server.
2. Start up the Pentaho data-integration in debug mode. Once started, ssh into the running karaf container. The ssh credentials are "karaf"/"karaf".
3. Install your bundle from the maven repository.
See it in Action
It is assumed that you have set up your system for data lineage. If you have not already done so, see "Setup" for data lineage.
- Save a transformation which contains a step you want to explore with the analyzer. (In the sample, use the table input step).
- Connect a remote debugger to PDI on port 5005. Enter a breakpoint in your step analyzer's implementation.
- Execute your transformation from PDI.
- You should hit your breakpoint when the step you are exploring is assessed by the transformation analyzer. The execution will generate a GraphML file (along with an execution profile) for the transformation. You can find these files in data-integration/pentaho-lineage-output/<Date of execution>/original/path/to/the/file/. You can use a tool such as yEd to view the GraphML files.
Working with yEd can be difficult. We have created a configuration for yEd to help ease the pain of viewing these graphs which you can access here: https://github.com/pentaho/pentaho-engineering-samples/tree/master/Supplementary Files/yED Configuration Files. Read the README text file for help.
In yEd, you will need to apply a layout to view the graph properly. Otherwise, all of the nodes will overlap each other.
Different Types of Step Analyzers
In the process of implementing custom step analyzers, we discovered a few generic patterns based on the type of step.
- First, there are the traditional steps which just take some input fields, manipulate them in some fashion, and then output them.
- The second type are the input and output steps. These steps use an external resource (file, database, web service, etc) to read or write data.
- The last is a more specific type of the second, and one which requires a logical connection to an external resource, typically a database or noSql data store.
These patterns are the basis for the three main base classes you might consider extending when implementing a custom step analyzer.
If the step you are writing a custom analyzer for is just a traditional step which manipulates data or fields to produce different outputs than inputs, then you should extend your step analyzer. An example of this kind of step analyzer would be Strings cut. It is much easier to understand the graph model when looking at it.
Below is the basic graph model for the Strings cut step:
In this example, three fields are inputs into the step: 'FirstName', 'LastName', and 'Middle Name'. Four fields are derived as the outputs: 'FirstName', 'LastName', 'MI' (middle initial), and 'Middle Name'. In the example below, the Strings Cut step uses just the Middle Name input field to create the 'MI' output field from the first character.
Looking at the graph above, you can see that there are four 'derives' links corresponding to the four output fields. The 'Middle Name' input field results in two derive links to both the 'Middle Name' and the 'MI' output fields. The base StepAnalyzer will create the inputs and outputs (fields and links) for you, but it is up to you to inform the base analyzer about the fields to use and which fields derive other fields.
Override the getUsedFields method to supply the fields used by the step. In the example above, the only input field used by the step is 'Middle Name'.
To supply the non-passthrough derives links and operation information, override the getChangeRecords method. In the above example, the non-passthrough derives link from the 'Middle Name' input field to the 'MI' output field is created from this override method.
By default, the implementation determines if a field is a passthrough field or not. If this logic isn't sufficient for your step, then override the isPassthrough method like the StringsCutStepAnalyzer does. The default logic assumes that if there is an output field with an identical name as an input field, then it is a passthrough.
External Resource (input/output)
If you are writing a custom analyzer for a step which reads or writes data from an external source like a file, extend ExternalResourceStepAnalyzer. An example analyzer that extends this is TextFileOutputStepAnalyzer.
Above is a typical file-based output step graph diagram (CSV would be very similar). This kind of step analyzer is different in that it creates resource nodes for fields and files which it touches (the yellow boxes). To accomplish this in a custom step analyzer, there are a few steps you must take. First, you must implement the abstract methods defined in ExternalResourceStepAnalyzer:
Next, you need to create a class which implements IStepExternalResourceConsumer. You will want to extend the base class BaseStepExternalResourceConsumer to help make your job a bit easier. External Resource Consumers are used in two places: once when the execution profiles are generated to determine what resources are read from/written to, and once by the step analyzers. In your blueprint.xml file, you will need to define the bean, publish the service, and inject the bean into your step analyzer:
The custom logic portions of the TextFileOutputStepAnalyzer are in the fields it uses, and this logic determines the fields which are actually written to the file.
Connection-Based External Resource
If the step you are writing a custom analyzer for is using a connection like a database connection, then you should extend ConnectionExternalResourceStepAnalyzer. An example of this type of analyzer is TableOutputStepAnalyzer. Connection-based analyzers are just a more specific type of step analyzer than the external resource step analyzers. It is an external resource analyzer which also has a connection analyzer and understands the concept of a table.
All IStepAnalyzers can optionally support the notion of a property called connectionAnalyzer. A connection analyzer is a specific type of analyzer. Its job is to build the relationships and nodes for an external connection. Some examples of connection analyzers are for traditional databases, noSQL databases, HDFS, etc. The metaverse exposes two IDatabaseConnection analyzers for reuse in external bundles (like the one outlined here). You can inject either stepDatabaseConnectionAnalyzer or jobEntryConnectionAnalyzer into your analyzer by grabbing hold of a reference to the exposed service (see below). If you need a custom connectionAnalyzer, you can implement your own and use that in your bundle.
Adding Analyzers from Existing PDI Plug-ins (non-OSGi)
The main difference is how you register your analyzer with the rest of the metaverse analyzers. Since this isn't an OSGi bundle, the blueprint configuration is not an option. Instead, you will have to create a KettleLifecyclePlugin which instantiates your analyzer class and registers it with PentahoSystem.
There is an example of how this process works in a branch on my github repo (a fork of the load-text-from-file-plugin originally written by Matt Burgess). Here is the branch, but we will reference specific files in that branch in the following snippets https://github.com/rfellows/load-text-from-file-plugin/tree/BACKLOG-3147.
- The step analyzer class: https://github.com/rfellows/load-text-from-file-plugin/blob/BACKLOG-3147/src/main/java/org/pentaho/di/trans/steps/loadtextfromfile/LoadTextFromFileAnalyzer.java
- The external resource consumer class: https://github.com/rfellows/load-text-from-file-plugin/blob/BACKLOG-3147/src/main/java/org/pentaho/di/trans/steps/loadtextfromfile/LoadTextFromFileExternalResourceConsumer.java
- The life cycle listener: https://github.com/rfellows/load-text-from-file-plugin/blob/BACKLOG-3147/src/main/java/org/pentaho/di/trans/steps/loadtextfromfile/LoadTextPluginLifecycleListener.java
The life cycle listener is a new plug-in: