Skip to main content

Pentaho+ documentation is moving!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Define data connections

Parent article

PUC (Pentaho User Console) and PDI (Pentaho Data Integration) can access data from various sources. If your data source is a database, you must connect to the database before accessing its records. You define database connections in PUC and PDI through the Database Connection dialog box.

Before you can create a connection, the appropriate driver must be installed for your particular data connection. Your IT administrator should be able to install the appropriate driver for you.

Open the connection dialog box

You can access the Database Connection dialog box from either PUC or PDI.

Open the Database Connection dialog box from PUC

Perform the following steps to open a new database connection in PUC:

Procedure

  1. Log on to PUC, then click on File Manage Data Sources.

    The Manage Data Sources dialog box appears.
  2. Select New Connection in the More actions and options menu.

    The Database Connection dialog box appears, as shown below:Database Connection dialog box
  3. Enter your data connection information and test.

    See Enter database connection information for further details.

Open the Database Connection dialog box from PDI

Perform the following steps to open a new database connection in PDI:

Procedure

  1. Start the PDI client (Spoon) and create a new transformation or job.

  2. In the View tab of the Explorer pane, double-click on the Database connections folder.

    The Database Connection dialog box appears, as shown below:Database Connection dialog box
  3. Enter your data connection information and test.

    See Enter database connection information for further details.

Next steps

In PDI, you can define connections to multiple databases provided by multiple database vendors such as MySQL and Oracle. PDI ships with the most suitable JDBC drivers for PostgreSQL, our default database.

Pentaho recommends avoiding ODBC connections. The ODBC to JDBC bridge driver does not always provide an exact match and adds another level of complexity, which affects performance. The only time you may have to use ODBC is if no JDBC driver is available. For details, see the Pentaho Community article on why you should avoid ODBC.

When you define a database connection in PDI, the connection information (such as the user name, password, and port number) is stored in the Pentaho Repository and is available to other users when they connect to the repository. If you are not using the Pentaho Repository, the database connection information is stored in the XML file associated with your transformation or job.

You must have information about your database (such as your database type, port number, user name and password) before you define a JDBC connection. In PDI, you can also set connection properties as variables. Through such variables, your transformations and jobs can access data from multiple database types.

Make sure to use clean ANSI SQL that works on all the database types used.

Enter database connection information

The type of database connection information entered depends on your access protocol. The following sections show examples for Native (JDBC) and OCI protocols:

Native (JDBC) protocol information

Perform the following steps to create a Native (JDBC) connection in the Database Connection dialog box:

Procedure

  1. In the Connection Name field, enter a name that uniquely describes this connection.

    The name can have spaces, but it cannot have special characters (such as #, $, and %).
  2. In the Database Type list, select the database you want to use (for example, MySQL or Oracle).

  3. In the Access list, select Native (JDBC). The access protocol which appears depends on the database type you select.

  4. In the Settings section, enter the following information:

    FieldDescription
    Host NameThe name of the server that hosts the database to which you are connecting. Alternatively, you can specify the host by IP address.
    Database NameThe name of the database to which you are connecting. If you are using a ODBC connection, enter the Data Source Name (DSN) in this field.
    Port NumberThe TCP/IP port number (if it is different from the default)
    User NameOptional user name used to connect to the database
    PasswordOptional password used to connect to the database
  5. Click Test.

    A success message appears if the connection is established.
  6. Click OK to close the connection test dialog box.

  7. To save the connection, click OK to close the Database Connection dialog box.

    • In PUC, your connection name appears in the list of available data sources in the Manage Data Sources dialog box.
    • In PDI, your connection name appears under the Database connections folder in the View tab.

OCI protocol information (PDI only)

Perform the following steps to create an OCI connection in the PDI Database Connection dialog box:

Procedure

  1. In the Connection Name field, enter a name that uniquely describes this connection.

    The name can have spaces, but it cannot have special characters (such as #, $, and %).
  2. In the Database Type list, select Oracle.

  3. In the Access list, select OCI. The access protocol which appears depends on the database type you select.

  4. In the Settings section, enter the following information as directed by the Oracle OCI documentation.

    FieldDescription
    SIDThe Oracle system ID that uniquely identifies the database on the system
    Tablespace for DataThe name of the tablespace where the data is stored
    Tablespace for IndicesThe name of the tablespace where the indices is stored
    User NameThe user name used to connect to the database
    PasswordThe password used to connect to the database
  5. Click Test.

    A success message appears if the connection is established.
  6. Click OK to close the connection test dialog box.

  7. To save the connection, click OK to close the Database Connection dialog box.

Next steps

If you want to use Advanced, Options, or Pooling for your OCI connection, refer to the Oracle OCI documentation to understand how to specify these settings.

Edit existing connections

Once a connection has been established, you can open the Database Connection dialog box to refine and change aspects of the connection.

Edit database connections in PUC

Perform the following steps to edit an existing database connection in PUC:

Procedure

  1. Click File Manage Data Sources.

    The Manage Data Sources dialog box appears.
  2. Highlight the name of the data source you want to edit and select Edit in the More actions and options menu.

    The Database Connection dialog box appears with General highlighted in the left navigation pane.
  3. Select items in the left navigation pane to further refine your connection or to enable/disable connection options.

Edit database connections in PDI

Perform the following steps to edit an existing database connection in PDI:

Procedure

  1. With a transformation or job open in the PDI client canvas, expand the Database connections folder in the View tab of the Explorer pane

  2. Right-click on a connection name and select Edit. The Database Connection dialog box appears with General highlighted in the left navigation pane.

  3. Select items in the left navigation pane to further refine your connection or to enable/disable connection options.

Specify advanced configuration of database connections

Use the Advanced option in the Database Connection dialog box to configure properties associated with how SQL is generated. With these properties, you can set a standard across all your SQL tools, ETL tools, and design tools.

Procedure

  1. Open the Database Connection dialog box in PUC or PDI.

  2. Click Advanced on the left pane.

    The available options depend on whether you are using PUC or PDI as shown below:Advanced tab in the  (left) and  (right) Database        Connection dialog boxes
  3. Check the appropriate boxes and enter the SQL statements as described in the following table:

    IdentifierDescription
    Supports the Boolean data type*Instructs PDI to use native Boolean data types supported by the database.
    Supports the timestamp data type*Instructs PDI to use the timestamp data type supported by the database.
    Quote all in databaseEnables case-sensitive table names. For example, MySQL is case-sensitive on Linux, but not case-sensitive on Microsoft Windows. If you quote the identifiers, the databases uses a case-sensitive table name.
    Force all to lower-caseEnables the system to change the case of all database to lower-case.
    Force all to upper-caseEnables the system to change the case of all identifiers to upper-case.
    Preserve case of reserved words*Instructs PDI to use a list of reserved words supported by the database.
    Preferred schema name*For PDI, enter the preferred schema name (for example, MYSCHEMA).
    SQL statementsEnter the SQL statement used to initialize this connection.
    NoteWhich preferences appear depends on if you are accessing the dialog box from PUC or PDI. The additional fields available in PDI are indicated with an asterisk (*).
  4. Click Test. A success message appears if the connection is established. Click OK to close the connection test dialog box.

  5. To save the connection, click OK to close the Database Connection dialog box.

Quoting

Pentaho uses a database-specific quoting system. With this system, you can use any name or character that complies with the supported databases' naming conventions.

Both PUC and PDI contain a list of reserved words for most of the supported databases. Pentaho maintains a strict separation between the schema (the user or owner of a table) and the table name itself to correctly quote table or field names that contain one or more periods in them. Placing periods in table and field names is common practice in some ERP systems (for example, fields such as V.A.T.)

To avoid quoting-related errors, a rule stops the Pentaho software from performing quoting activity when there is a start or end quotation mark in the table or schema name. This allows you to specify the quoting mechanism yourself.

Set database-specific options

Use the Advanced option in the Database Connection dialog box to configure properties associated with how SQL is generated. With these properties, you can set a standard across all your SQL tools, ETL tools, and design tools.

Procedure

  1. Open the Database Connection dialog box in PUC or PDI.

  2. Click Options in the left pane.

    The Parameters table appears as shown as shown below:Options tab in the PUC (left) and PDI (right)         Database Connection dialog boxes
  3. In the next available row of the Parameters table, enter a valid parameter name and its corresponding value. For JDBC database-specific configuration help, click Help.

    A new browser window opens and displays additional information about configuring the JDBC connection for the database type that is currently selected in the General pane.
  4. Click Test.

    A success message appears if the connection is established.
  5. Click OK to close the connection test dialog box.

  6. To save the connection, click OK to close the Database Connection dialog box.

Define connection pooling

You can use the Pooling option in the Database Connection dialog box to set up a connection pool and define options like the initial pool size, maximum pool size, and connection pool parameters. By default, a connection remains open for each individual report or set of reports in PUC and for each individual step in a transformation in PDI. For example, you might start by specifying a pool of ten or fifteen connections, and as you run reports in PUC or transformations in PDI, the unused connections drop off. Pooling helps control database access, especially if you have dashboards that contain many reports and require a large number of connections. Pooling can also be implemented when your database licensing restricts the number of active concurrent connections.

Perform the following steps to specify pooling options:

Procedure

  1. Open the Database Connection dialog box in PUC or PDI.

  2. Click Pooling in the left pane.

    Options appear for your JDBC driver as shown in the example below:Pooling options in the PUC and         PDI        Database Connection dialog boxesThe following table shows an example of Pooling options that might be available in a typical JDBC driver. Check your driver documentation for driver-specific pooling details.
    OptionDescription
    Enable Connection PoolingEnables connection pooling.
    Pool Size
    • Initial

      Set the initial size of the connection pool.

    • Maximum

      Set the maximum number of connections in the connection pool.

    Parameters

    You can define additional custom pool parameters. Click on any parameter to view a short description of that parameter. Click Restore Defaults when to restore the default values for selected parameters.

    The most commonly-used parameter is validationQuery. The parameter differs slightly depending on your RDBMS connection. The basic set of Pentaho databases use the following values for validationQuery:

    • For Oracle and PostgreSQL, use Select 1 from dual.
    • For MS SQL Server and MySQL, use Select 1.
    DescriptionEnter a description for your parameters.
  3. Click Test.

    A success message appears if the connection is established.
  4. Click OK to close the connection test dialog box.

  5. To save the connection, click OK to close the Database Connection dialog box.

Connect to clusters (PDI only)

Use the Clustering options in the Database Connection dialog box to cluster the database connection and create connections to data partitions in PDI. To create a new connection to a data partition, enter a Partition ID, the Host Name, the Port, the Database Name, User Name, and Password for the connection.

If you have the Pentaho Server configured in a cluster of servers, and use the Data Source Wizard(DSW) in PUC to add a new data source, the new data source will only be seen on the cluster node where the user has a session. For the new data source to be seen by all the cluster nodes, you must disable DSW data source caching. This may cause the loading of the data source list to be slower since the list is not cached.

To disable the cache, navigate to the server/pentaho-server/pentaho-solutions/system folder and set the enableDomainIdCache value in the system.properties file to false.

Modify connections

You can perform other various database-related connection tasks from either PUC or PDI.

Modify connections from PUC

Access other database-related connection tasks in PUC through the More actions and options menu in the Manage Data Sources dialog box, as shown below:

More actions and options menu in the PUC             Manage Data Sources dialog box

The following table describes these tasks:

TaskDescription
ExportExport database connection information to an XML file.
Import AnalysisImport data values from a Mondrian file into a data source.
Import MetadataImport data values from an XMI metadata file into a data source.

Modify connections from PDI

Access other database-related connection tasks in PDI by right-clicking on the connection name in the View tab of the Explorer pane, as shown below:

Other database-related tasks

The following table describes these tasks:

TaskDescription
DuplicateDuplicate the database connection. The duplicate will not be created unless you specify a different Connection Name in the Database Connection dialog box when it appears.
Copy to clipboardCopy the XML defining the step to the clipboard.
SQL EditorExecute SQL commands against an existing connection within the SQL Editor.
Clear DB CacheClear out the database cache used by PDI to speed up connections. This command is commonly used when databases tables have been changed, created, or deleted (when the information in the cache no longer represents the layout of the database).
ShareShare the connection information among transformations and jobs.
ExploreUse the Database Explorer to explore the schemas and tables of your connected database.
Show dependencesShow all of the transformations and jobs that use this database connection.

Delete connections

If you no longer need a connection, you can delete it.

Delete connections from PUC

Perform the following steps to delete a connection in PUC:

CautionDeleting a connection affects all reports, charts, dashboards, and other content that are associated with the connection.

Procedure

  1. Click File Manage Data Sources.

    The Manage Data Sources dialog box appears.
  2. Highlight the name of the data source you want to delete and select Delete in the More actions and options menu.

    The data source no longer appears in the list of data sources.

Delete connections from PDI

Perform the following steps to delete a connection in PDI:

Procedure

  1. Expand the Database connections folder in the View tab of the Explorer pane.

  2. Right-click on a connection name and select Delete.

    The data source no longer appears under the Database connections folder.