Skip to main content

Pentaho+ documentation is moving!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Pentaho Data Services

Parent article

Prototyping a data model can be time consuming, particularly when it involves setting up databases, creating the data model and setting up a data warehouse, then negotiating accesses so that analysts can visualize the data and provide feedback.

One way to streamline this process is to make the output of a transformation step a Pentaho Data Service. The output of the transformation step is exposed by the data service so that the output data can be queried as if it were stored in a physical table, even though the results of the transformation are not stored in a physical database. Instead, results are published to the Pentaho Server as a virtual table.

NoteYou must have a Pentaho Server and repository to publish the data service.

The virtual table is a JDBC-compliant data source that you and others can connect to or query with SQL, provided they can access the server and the transformation. The Pentaho Data Service can be connected to or queried by a JDBC-compliant tool such as Pentaho Report Designer, Pentaho Interactive Reports, and CTools as well as other compatible tools like RStudio, DBVisualizer, or SQuirreL.

The Pentaho Data Service can also be used in some instances where building and maintaining a data warehouse is sometimes impractical or inefficient, especially when you need to quickly blend and visualize fast-moving or quickly evolving data sets on the fly. For example, if you want to compare your product prices with your competitors, you can create a transformation that blends prices from your in-house data sources and competitor prices. Then, you can convert the output step in the transformation into a Pentaho Data Service that creates a virtual table for querying when you connect to the Pentaho Server. You or others can connect to and query the virtual table, as you would any other JDBC data source to visualize the results in Analyzer or another tool.

The Pentaho Data Service also has a testing tool. This tool generates several logs and reports that you can use to refine the data service and determine where to apply specialized optimizations. You can also define parameters that others can use to pose customized queries. For example, you can create a data service that publishes a virtual “fact” table of a moderately-sized research dataset to a Pentaho Server. You can test and add optimizations and parameters, such as gender or test type so that the data service runs more quickly. Then, you can share connection and parameter information with a group of researchers, who can query the virtual table. Researchers can use Pentaho Interactive Reporting, a dashboard created with CTools, or an application of their choice, such as RStudio, to analyze and visualize the research dataset.

Pentaho Data Services support a subset of SQL. For more details on what is supported, see Pentaho Data Service SQL support reference and other development considerations. Also, see Components Reference for a complete list of traditional supported data sources.

Creating a regular or streaming Pentaho Data Service

You can create either a regular data service or a streaming data service. A streaming data service is commonly used when creating a streaming data dashboard with CTools.

Perform the following steps to create a data service, name it, then select the step that outputs the data you want to be exposed to the data service:

Procedure

  1. Create or open a transformation in the PDI client.

    It is helpful to review Pentaho Data Service SQL support reference and other development considerations as you create or review the transformation so that you understand which SQL commands are supported.
  2. Save the transformation to the Pentaho Server.

  3. Right-click the transformation step that outputs the data you want to make available as a data service, then select Data Services New.

  4. Enter a unique name for the data service in the Service Name (Virtual Table Name) text box.

    The virtual table that the data service creates has the same name as the data service.
    NoteEnsure the name is unique; no other data service stored locally or published to the Pentaho Server should have the same name.
  5. Verify that the Output step is the step you selected to become the data service in Step 3. If you need to change it, select the correct step name from the list.

  6. If you are working with streaming data, select Streaming for the Data Service Type.

  7. Click OK to save the data service and exit the window.

    The data service badge is added to the step icon.

Next steps

A recommended but optional step, is to follow the steps in Test a Pentaho Data Service so that you can test your data service. Testing can help you correct, refine, and optimize your data service so it runs more efficiently.

Data service badge

After you create a Pentaho Data Service from a step, a data service badge is added to that step. The badge will designate whether the step has a related regular or streaming data service, as shown in the following images:

Regular and streaming data service badge types

These badges will help to indicate which steps in your transformations are also data services and whether they are regular or streaming.

Open or edit a Pentaho Data Service

To open a data service for viewing or editing, perform one of the following actions in the PDI client:
  • In the View tab of the Explore pane, click Data Services. Right-click the name of the data service you want to open or edit, then select Edit.
  • Right-click the step, indicated by the data service badge, that contains the data service, then select Data Services Edit.

Delete a Pentaho Data Service

To delete a data service, perform one of the following actions in the PDI client:
  • In the View tab of the Explore pane, click Data Services. Right-click the name of the data service you want to delete, then select Delete.
  • Right-click the step, indicated by the data service badge, that contains the data service, then select Data Services Delete.

Test a Pentaho Data Service

After creating your data service, test it to ensure that it runs properly and generates the data you need. Testing can uncover bottlenecks in your transformation and can help you discover places where optimization techniques can be applied.

Run a basic test

To run a basic test on a regular data service, perform the following steps:

Procedure

  1. Verify Data Service Type is set to Regular.

  2. Perform one of the following actions to open the Test Data Service window:

    • In the Data Service window, click Test Data Service.
    • In the View tab of the PDI client Explore pane, click Data Services. Right-click the name of the data service you want, then select Test.
    • Right-click the step attached to the data service as indicated by the data service badge, then select Data Services Test.
  3. If needed, adjust the following optional settings:

    • Log Level: Sets the amount of detail shown in the logs the test generates. Log results appear in the Query Results, Optimized Queries, Service Transformation Logging, Generated Transformation Logging, Service Metrics, and SQL Trans Metrics tabs. These tabs are detailed in Examine Test Results.
    • Max Rows: Sets the maximum number of rows you want to see in your test results.
  4. To run the test, click Execute SQL.

  5. Examine the test results using the instructions in Examine test results.

  6. If you need to run another test, clear the cache, then run the test again.

  7. Click Close to exit the window.

  8. Optionally, you can choose to add an optimization if you want to make it run more efficiently.

  9. Publish a Pentaho Data Service.

Run a streaming optimization test

To test streaming data, the records of the stream must be partitioned into windows (batching) for processing. How the records are batched depends on the window mode you choose. A window can be time-based or row-based. A time-based window is created within a specified interval of time. A row-based window is created per the specified number of rows collected for processing.

To run an optimization test on a streaming data service, perform the following steps:

Procedure

  1. Verify Data Service Type is set to Streaming.

  2. Perform one of the following actions to open the Test Data Service window:

    • In the Data Service window, click Test Data Service.
    • In the View tab of the PDI client Explore pane, click Data Services. Right-click the name of the data service you want to test, then select Test.
    • Right-click the step in the transformation (identified by the data service badge) that attaches to the data service, then select Data Services Test.
  3. Select the window mode (Time Based or Row Based) used for processing during the test, and specify the times (in milliseconds) or number of rows depending on the window mode for the following settings:

    • Window Size: Defines the number of rows that a window will have (row-based), or the time frame, in milliseconds, for capturing new rows to a window (time-based).
    • Every: Sets the number of rows (row-based), or milliseconds (time-based) that should elapse before creating a new window.
    • Limit: Sets the maximum number of milliseconds (row-based) or rows (time-based) to wait for a new window to be generated.
  4. If needed, adjust the following optional settings:

    • Log Level: Sets the amount of detail shown in the logs the test generates. Log results appear in the Query Results, Optimized Queries, Service Transformation Logging, Generated Transformation Logging, Service Metrics, and SQL Trans Metrics tabs. These tabs are detailed in Examine test results.
    • Max Rows: Sets the maximum number of rows you want to see in your test results.
  5. To run the test, click Execute SQL.

  6. Examine the test results using the instructions in Examine test results.

  7. Click Stop to stop execution of the test.

  8. Click Close to exit the window.

  9. Optionally, you can choose to add an optimization if you want to make it run more efficiently.

  10. Publish a Pentaho Data Service.

Run an optimization test

If you have added an optimization of your data service, you will need to test it. First, follow the instructions in Run a basic test section, and then modify your query in the SQL area to pass a parameter for the optimization, as shown in the following example:

SELECT * FROM Mars WHERE rover=’Curiosity’
  • To preview the optimization click Preview Optimization in the Test Data Service window.
  • Use Examine test results instructions to help you interpret your test results.

Examine test results

Test results appear in the tabs in the bottom half of the Test Data Service window.

When you test or run a data service, there are two transformations that run: a service transformation and a generated transformation. The service transformation is the one that you created. When the transformation runs, PDI generates a second transformation (the generated transformation) based on the executed SQL.

The following table describes what appears in the tabs and possible test tips on how to proceed:

TabDescriptionTest Tips
Query ResultsShows events that occurred during the test run, as well as the query results.Review this tab to ensure that you are getting the correct results. When streaming data, you can observe updates to the data when a new data window is retrieved.
Optimized QueriesDisplays processing information and the result of optimizations. Optimizations are techniques that can speed processing and help the Pentaho Data Service run more efficiently.

Review this tab for the results of applied optimizations. The results can help you verify that the optimization was applied correctly. Consider using these results along with Service Metrics and Generated Transformation Logging results to see where optimizations are needed, and to assess whether optimizations are helping the transformation to run more efficiently.

This tab shows the starting query as well as the input step queries after push down optimizations have been applied. Several input queries may be displayed in the tab, one for each input step that supports push down.

Service Transformation LoggingShows logs from the service transformation as it runs.Review this tab for problems in the way the transformation is designed or built. If you see errors, review your transformation and any optimizations you might have applied. It is helpful to review the Service Transformation Logging and Generated Transformation Logging tabs together.
Generated Transformation LoggingShows logs from the generated transformation as it runs.Review this tab for problems in the way the PDI transformation was generated. If you find issues, look at Pentaho Data Service SQL support reference and other development considerations to ensure the SQL in the transformation is supported. It is also helpful to review the Service Transformation Logging and Generated Transformation Logging tabs together.
Service MetricsShows a GANTT chart of the amount of time, in milliseconds it took to run the different parts of the service transformation.Review the contents of this tab to find bottlenecks and to identify candidates for step optimization. Compare the Service Metrics and SQL Trans Metrics tabs to see if the same bottlenecks exist. If different bottlenecks occur, consider refining your transformation to eliminate or reduce them.
SQL Trans MetricsShows a GANTT chart of the amount of time, in milliseconds it took to run the generated transformation.Review the contents of this tab to find bottlenecks and to identify candidates for step optimization. Compare the Service Metrics and SQL Trans Metrics tabs to see if the same bottlenecks exist. If different bottlenecks occur, consider refining your transformation to eliminate or reduce them.

Publish a Pentaho Data Service

To publish your data service, save the transformation containing the data service to the Pentaho Repository.

Perform the following steps to validate that your data service has been published:

Procedure

  1. Since the Pentaho Repository is in the Pentaho Server ensure any external assets needed by the transformation can be accessed remotely, even if the transformation is run by a different user.

  2. Open a browser, go to the Pentaho Server, and log in. If you have installed the Pentaho Server locally, the URL is usually localhost:8080.

  3. Validate that your data service was published by listing the data services on the server, as shown in the following example:

    http://<Pentaho Server Host:Port>/pentaho/kettle/listServices
    

Results

You are now ready to share the data service with others.

Share a Pentaho Data Service with others

This section describes how to share a Pentaho Data Service.

Share a Pentaho Data Service with others

With your data service now created and tested, you can share it with others so they can identify it, connect to it, and query it.

Perform the following actions to share your data service:

Procedure

  1. Ensure the person or group, with whom you want to share your data service, has the following permissions:

    • Connect, read, and execute a query on the Pentaho Server to retrieve data service data.
    • Read and execute the transformation that contains the data service.
  2. Provide the unique data service name as saved to the Pentaho Repository to the person or group that you want to share your data service.

  3. If you have optimized the data service, provide the person or group with names and definitions for any parameters that you have created so they can use the parameters when they query the data service.

  4. Provide the person or group with the instructions in Connect to the Pentaho Data Service from a Pentaho tool or Connect to the Pentaho Data Service from a Non-Pentaho tool depending on how the service is used.

Connect to the Pentaho Data Service from a Pentaho tool

A Pentaho Data Service is a virtual table that contains the output of a step in a PDI transformation. You can connect to and query a regular Pentaho Data Service from any Pentaho tool, such as Report Designer, the PDI client (Spoon), and Analyzer. You can connect to and query a streaming Pentaho Data Service from a dashboard created with CTools.

NoteTo connect and query the Pentaho Data Service, you need to know the data service name and have permission to run the transformation and to access the Pentaho Server where it is published.

Connecting to the data service from another Pentaho tool is the same as connecting to a database. For information on connecting to a database, refer to Define data connections for the Pentaho Server. The following tables describe the parameters needed to make a Pentaho Data Service connection:

Required ParametersDescription
Connection NameThe unique name of the data service you want to access
Connection TypePentaho Data Services
AccessNative (JDBC)
HostnameHostname of the Pentaho Server or IP address. By default, this is localhost if running the Pentaho Server locally.
Port NumberPort number of the Pentaho Server the data service will run on. The default is 8080.
Web App NameName of the web application. The default value is pentaho, which is typically used by the other Pentaho tools.
UsernameName of a user who has permission to run the data service.
PasswordPassword for a user who has permission to run the data service.

You can also set the following optional parameters.

Optional ParametersDescription
proxyhostnameProxy server for HTTP connection(s).
proxyportProxy server port.
nonproxyhostsHosts that do not use the proxy server. If there is more than one host name, separate them with commas.
debugtransOptional name of the file where the generated transformation is stored. This transformation is generated to debug it. Example: /tmp/debug.ktr. Specify the name of the transformation or a path plus the name of the transformation.
PARAMETER_[optionname]=valueSets the value for a parameter in the transformation. [optionname] is the name of the parameter, and [value] is the value assigned to it. PARAMETER_ is placed before the option name. For example, if the name of the parameter is model, set the parameter: PARAMETER_model=E6530.
secureSet this parameter to TRUE to use the HTTPS secure protocol connect to the data service. If you omit this parameter or set it to FALSE, the standard HTTP unsecure protocol is used.

Connect to the Pentaho Data Service from a Non-Pentaho tool

A Pentaho Data Service is a virtual table that contains the output of a step in a PDI transformation. You can identify, connect to, and query the Pentaho Data Service from a non-Pentaho tool, like RStudio or SQuirreL. For a streaming data service, you must use a JDBC driver from your non-Pentaho tool to access streaming operations.

NoteTo connect and query the Pentaho Data Service, you need to know the data service name and have permission to run the transformation and to access the Pentaho Server where it is published.

To connect to and run a Pentaho Data Service from a non-Pentaho tool, like SQuirreL or Beaker, you need to install the service driver files, then create a connection to the data service from your tool.

Before you can connect to and run a Pentaho Data Service from a non-Pentaho tool like SQuirreL or Beaker, you need to download the PDI Data Service driver and install it. The driver is bundled with PDI. If you want someone who does not have PDI to connect to your data service, you will need to download the driver and give it to them so that they can install it.

Step 1: Download the Pentaho Data Service JDBC driver

You can download the driver using the PDI client or manually.

Download using the PDI client
To download the Pentaho Data Service JDBC driver using the PDI client, perform the following steps:

Procedure

  1. Open the transformation and identify the data service step indicated by the data service badge.

  2. Perform one of the following actions:

    • Right-click the step, then select Data Service Driver Details.
    • In the transformation's View tab, select the data service from the Data Services folder, then right-click any data service and select Driver Details.
    • In the Data Services window, click Driver Details.
  3. In the Driver Details window, click Get Driver.

  4. In the Save As window, indicate where you want to save the Pentaho-Data-Service-Driver.zip file, then click Save.

  5. Click Close to exit the window.

  6. Follow the instructions to Install the Pentaho Data Services JDBC driver.

Download manually
To manually download the Pentaho Data Service JDBC driver, perform the following steps:

Procedure

  1. Go to the <Pentaho installation directory>/design-tools/data-integration/Data Service JDBC Driver directory and download the zipped file into it.

  2. Follow the instructions to Install the Pentaho Data Service JDBC driver.

Step 2: Install the Pentaho Data Service JDBC driver

To install the Pentaho Data Service JDBC driver, complete these steps.

Procedure

  1. Extract the driver files that you downloaded in the Download the Pentaho Data Service JDBC Driver instructions and paste them into the JDBC directory in your application. For example, if you want to access the data service from RStudio, paste the driver files in the same directory as the other JDBC drivers. If you are not sure where this is, check the documentation for that application.

  2. Start and stop the application.

  3. In the application, make a connection to the data service, using the instructions in Connect to the Pentaho Data Service from a Non-Pentaho tool.

Step 3: Create a connection from a non-Pentaho tool

Once the driver is installed, you will need to create the connection to the Pentaho Data Service. For many tools, you can connect by specifying a connection object. Review the connection details and options in Connect to the Pentaho Data Service from a Pentaho tool.

You will probably also need the following JDBC driver class parameter and its value:

  • JDBC Driver Class

    org.pentaho.di.trans.dataservice.jdbc.ThinDriver

The JDBC connection string uses the following format:

jdbc:pdi://<Pentaho Server Hostname:Port>/kettle?option=value&option=value

The following example shows how you might format a connection string:

jdbc:pdi://localhost:8080/kettle?webappname=pentaho

The webappname is required if the data service is running on the Pentaho Server.

Query a Pentaho Data Service

You can use SQL to query a data service. You can also add a conditional statement to the query if the data service transformation uses a parameter. The parameter must be added to the transformation by an ETL developer before the data service was created. See a Creating a regular or streaming Pentaho Data Service for more details. You can then assign a value to the parameter in your query.

To find the name of the table to query, you can connect to the data service, then use explorer to find the name of the table. The name of table is usually the same as the name of the data service.

The Pentaho Data Service has the following limitations:

Example

The following example of a SQL query shows how to include a parameter. The example queries the employeeList data service for records, and when the data service transformation is executed, any use of the parameter employeeRegion in it will be substituted with the value USA EAST. This parameter can be in any transformation step that allows variable or parameter substitution.

You can also assign a value to the parameter in the Connection Properties window.

Using the following sample syntax:

SELECT * FROM '<data service name>' WHERE PARAMETER('<parameter_name>') = '<parameter_value>'

The following query would be used for our example:

view sourceprint
   SELECT * FROM 'employeeList' WHERE PARAMETER('employeeRegion')='USA EAST'

Monitor a Pentaho Data Service

To monitor a data service, type one of the following commands into your browser:

  • List Names of Data Services on the Pentaho Server

    http://<Pentaho Server Host:Port>/pentaho/kettle/listServices
  • List Status of Data Services on the Pentaho Server

    http://<Pentaho Server Host:Port>/pentaho/kettle/status

Replace the <Pentaho Server Host:Port> of the Pentaho Server with the host name or IP address and the port of the Pentaho Server running the data service. You will need to have access to the Pentaho Server and be authenticated to run these commands.