Skip to main content
Pentaho Documentation

Set up the Pentaho Server to connect to a Hadoop cluster

Parent article

This article is for IT administrators who need to configure Pentaho to connect to a Hadoop cluster for teams working with Big Data.

Pentaho can connect to Cloudera Distribution for Hadoop (CDH), Google Dataproc, Hortonworks Data Platform (HDP), and Amazon Elastic MapReduce (EMR). Pentaho also supports related services such as HDFS, HBase, Oozie, ZooKeeper, and Spark. You can connect to clusters and services from these Pentaho components:

  • PDI client (Spoon)
  • Pentaho Server
  • Analyzer
  • Pentaho Interactive Reports
  • Pentaho Report Designer (PRD)
  • Pentaho Metadata Editor (PME)

You can configure the Pentaho Server to connect to a Hadoop cluster through a compatibility layer called a driver. Pentaho regularly develops and releases new drivers, so you can stay up-to-date with the latest technological developments. To view which drivers are supported for this version of Pentaho, see the Components Reference.

When drivers for new Hadoop versions are released, you can download them from the Pentaho Customer Support Portal and then add them to Pentaho to connect to the new Hadoop distributions. For more information about downloading and adding a new driver, see Adding a new driver.

Pentaho ships with drivers for Hortonworks, Amazon EMR, Google Dataproc, and Cloudera that you can install for the Pentaho Server. Before you can add a named connection to a cluster, you must install a driver for the vendor and version of the Hadoop cluster that you are connecting to.

NoteIf you are using the Pentaho Metadata Editor or Pentaho Report Designer, the drivers are already installed.

To learn about additional configurations for a specific distribution, click one of the following links:

Install a driver for the Pentaho Server

Before you can add a named connection to a cluster, you must install a driver for the vendor and version of the Hadoop cluster that you are connecting to. This task assumes that you have downloaded your driver from the Pentaho Customer Support Portal or that you are using a driver for Hortonworks, Amazon EMR, Google Dataproc, or Cloudera that is shipped with Pentaho.

Perform the following steps to install a driver for the Pentaho Server.

Procedure

  1. Verify that you are connected to a repository.

  2. In the PDI client, select the View tab of your transformation or job.

  3. Right-click the Hadoop clusters folder and click Add driver.

    The Add driver dialog box appears.Add driver dialog box
  4. Click Browse

    The Choose File to Upload dialog box appears.
  5. Navigate to the <pentaho home>/server/pentaho-server/pentaho-solutions/ADDITIONAL-FILES/drivers directory, where <pentaho home> is the directory where Pentaho is installed.

  6. Select the driver (.kar file) you want to add, click Open, and then click Next.

    The selected file name appears in the Browse text field. The vendor distribution files contain their abbreviations in the .kar file names as shown below:
    • Cloudera (cdh)
    • Hortonworks (hdp)
    • Amazon EMR (emr)
    • Google Dataproc (dataproc)
  7. Click Next.

    The Congratulations dialog box appears, notifying you that you must restart the Pentaho Server and the PDI client. The installed driver is now available for selection in the Driver field in the New cluster and Import cluster dialog boxes.

Manually install a driver for the Pentaho Server

You can manually install a driver for the Pentaho Server, even when you are not connected to the Pentaho Server with the PDI client. This task assumes that you have downloaded your driver from the Pentaho Customer Support Portal or that you are using a driver for Hortonworks, Amazon EMR, Google Dataproc, or Cloudera that is shipped with Pentaho.

Perform the following steps to manually install a driver for the Pentaho Server :

Procedure

  1. Navigate to the <pentahohome>/server/pentaho-server/pentaho-solutions/ADDITIONAL-FILES/drivers directory, where <pentaho home> is the directory where Pentaho is installed.

  2. Select the driver (.kar file) you want to add and copy it to the <pentaho home>/server/pentaho-server/pentaho-solutions/drivers directory on the machine with the Pentaho Server.

    The vendor distribution files contain their abbreviations in the .kar file names as shown below:
    • Cloudera (cdh)
    • Hortonworks (hdp)
    • Amazon EMR (emr)
    • Google Dataproc (dataproc)
  3. Restart the Pentaho Server.