Skip to main content
Pentaho Documentation

Set Up Pentaho to Connect to a Cloudera Cluster

Overview

These instructions explain how to configure Pentaho's Cloudera shim so Pentaho can connect to a working Cloudera's Distribution for Hadoop (CDH) cluster.

Before You Begin

Before you begin, you'll need to do a few things.

  1. Verify Support
    Check the Components Reference to verify that your Pentaho version supports your version of the CDH cluster.
     
  2. Set Up a CDH cluster
    1. Configure a CDH cluster.  See Cloudera's documentation if you need help.

    2. Install any required services and service client tools.

    3. Test the cluster.
       

  3. Get Connection Information
    Get the connection information for the cluster and services that you will use from your Hadoop Administrator, Cloudera Manager, or other cluster management tool.  You'll also need to supply some of this information to users once you are finished. 
     
  4. Add a YARN User to the Superuser Group
    Add the YARN user on the cluster to the group defined by dfs.permissions.superusergroup property. The dfs.permissions.superusergroup property can be found in hdfs-site.xml file on your cluster or in the Cloudera Manager.
     
  5. Review the Version-Specific Notes Section
    Read the Version-Specific Notes section to review special configuration instructions for your version of CDH.

If you are connecting to a secured CDH cluster there are a few additional things you need to do.

  1. Secure the Cloudera Cluster with Kerberos
    1. Pentaho supports Kerberos authentication.  You will need to:
    2. Configure Kerberos security on the cluster, including the Kerberos Realm, Kerberos KDC, and Kerberos Administrative Server. 

    3. Configure the name, data, secondary name, job tracker, and task tracker nodes to accept remote connection requests.

    4. Set up Kerberos for name, data, secondary name, job tracker, and task tracker nodes if you are have deployed CDH using an enterprise-level program.

    5. Add user account credentials to the Kerberos database for each Pentaho user that needs access to the Hadoop cluster.  Also, make sure there is an operating system user account on each node in the Hadoop cluster for each user that you want to add to the Kerberos database. Add operating system user accounts if necessary. Note that the user account UIDs should be greater than the minimum user ID value (min.user.id). Usually, the minimum user ID value is set to 1000.
       

  2. Set up Kerberos on your Pentaho computers
    Instructions for how to do this appear in the article Set up Kerberos on Your Pentaho Computer.

Edit Configuration Files on Clusters

Pentaho-specific edits to configuration files are the cluster are referenced in this section.

Oozie

By default, Oozie jobs are run by the Oozie user.  But, if you use PDI to start an Oozie job, you must add the PDI user to the oozie-site.xml file on the cluster so that the PDI user can execute the program in proxy. If you plan to use the Oozie service complete these instructions:

  1. Open the oozie-site.xml file on the cluster.
  2. Add the following lines of the code to the oozie-site.xml file on cluster, substituting <your_pdi_user_name> with the PDI User username, such as jdoe.
<property>
<name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.groups</name>
<value>*</value>
</property>
<property>
<name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.hosts</name>
<value>*</value>
</property>
  1. Save and close the file

Configure Pentaho Component Shims

You must configure the shim in each of the following Pentaho components, on each computer from which Pentaho will be used to connect to the cluster:

  • Spoon (PDI Client)
  • Pentaho Server, including Analyzer and Pentaho Interactive Reporting.
  • Pentaho Report Designer (PRD)
  • Pentaho Metadata Editor (PME)

As a best practice, configure the shim in Spoon first.  Spoon has features that will help you test your configuration.  Then copy the tested Spoon configuration files to other components, making changes if necessary. 

You can also opt to go through these instructions for each Pentaho component, and not copy the shim files from Spoon.  If you do not plan to connect to the cluster from Spoon, you can configure the shim in another component first instead.

Step 1: Locate the Pentaho Big Data Plugin and Shim Directories

Shims and other parts of the Pentaho Adaptive Big Data Layer are in the Pentaho Big Data Plugin directory.  The path to this directory differs by component. You need to know the locations of this directory, in each component, to complete shim configuration and testing tasks.

<pentaho home> is the directory where Pentaho is installed.

Components Location of Pentaho Big Data Plugin Directory
Spoon <pentaho home>/design-tools/data-integration/plugins/pentaho-big-data-plugin
Pentaho Server <pentaho home>/server/pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin
Pentaho Report Designer <pentaho home>/design-tools/report-designer/plugins/pentaho-big-data-plugin
Pentaho Metadata Editor <pentaho home>/design-tools/metadata-editor/plugins/pentaho-big-data-plugin

Shims are located in the pentaho-big-data-plugin/hadoop-configurations directory.  Shim directory names consist of a three or four letter Hadoop Distribution abbreviation followed by the Hadoop Distribution's version number.  The version number does not contain a decimal point.  For example, the shim directory named cdh54 is the shim for the CDH (Cloudera Distribution for Hadoop), version 5.4.  Here is a list of the shim directory abbreviations.

Abbreviation Shim
cdh Cloudera's Distribution of Apache Hadoop
emr Amazon Elastic Map Reduce
hdi Microsoft Azure HDInsight
hdp Hortonworks Data Platform
mapr MapR

Step 2: Select the Correct Shim

Although Pentaho often supports one or more versions of a Hadoop distribution, the download of the Pentaho suite only contains the latest, supported, Pentaho-certified version of the shim.  The other supported versions of shims can be downloaded from the Pentaho Customer Support Portal

Before you begin, verify that the shim you want is supported by your version of Pentaho shown in the Components Reference.

  1. Navigate to the pentaho-big-data-plugin/hadoop-configurations directory to view the shim directories. If the shim you want to use is already there, you can go to Step 3: Copy the Configuration Files from Cluster to Shim
  2. On the Customer Portal home page, sign in using the Pentaho support user name and password provided to you in your Pentaho Welcome Packet. 
  3. In the search box, enter the name of the shim you want. Select the shim from the search results. Optionally, you can browse the shims by version on the Downloads page. 
  4. Read all prerequisites, warnings, and instructions. On the bottom of the page in the Box widget, click the shim zip file to download it. 
  5. Unzip the downloaded shim package to the pentaho-big-data-plugin/hadoop-configurations directory.

Step 3: Copy the Configuration Files from Cluster to Shim

Copying configuration files from the cluster to the shim helps keep key configuration settings in sync with the cluster and reduces configuration errors.

  1. Back up the CDH shim files inthe  pentaho-big-data-plugin/hadoop-configurations/cdhxx directory.
  2. Copy the following configuration files from the cluster to the Pentaho shim directory.  You should overwrite the existing Pentaho shim files.
  • core-site.xml
  • hbase-site.xml
  • hdfs-site.xml
  • hive-site.xml
  • mapred-site.xml
  • yarn-site.xml

Step 4: Edit the Shim Configuration Files

You need to verify or change settings in authentication, Oozie, Hive, MapReduce, and YARN in these shim configuration files:

  • core-site.xml
  • config.properties
  • hive-site.xml
  • mapred-site.xml
  • yarn-site.xml

Edit config.properties (Unsecured Cluster)

If you are connecting to an unsecure cluster, verify that these values are properly set.  Set the Oozie proxy user if needed.

  1. Navigate to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory and open the config.properties file.
  2. Add the following values:
Parameter Values
authentication.superuser.provider NO_AUTH
pentaho.oozie.proxy.user Add a proxy user's name to access the Oozie service through a proxy, otherwise, leave it set to oozie
  1. Save and close the file.

Edit config.properties (Secured Clusters)

If you are connecting to a secure cluster, add Kerberos information to the config.properties file. If you plan to use secure impersonation to access your cluster, see Use Secure Impersonation to Access a Cloudera Cluster before editing the config.properties file.

Perform the following steps to add Kerberos information to the config.properties file: 

  1. Navigate to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory and open the config.properties file.
  2. Add these values:
Parameter Values
authentication.superuser.provider cdh-kerberos (This should be the same as the authentication.kerberos.id.)
authentication.kerberos.principal Set the Kerberos principal.
authentication.kerberos.password Set the Kerberos password.  You only need to set the password or the keytab, not both.
authentication.kerberos.keytabLocation set the Kerberos keytab. You only need to set the password or the keytab, not both.
pentaho.oozie.proxy.user Add the proxy user's name if you plan to access the Oozie service through a proxy.  Otherwise, leave it set to oozie.
  1. Save and close the file.

Edit hive-site.xml

Follow these instructions to set the location of the hive metastore in the hive-site.xml file:

  1. Navigate to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory and open the hive-site.xml file.
  2. Add these values:
Parameter Value
hive.metastore.uris Set this to the location of your hive metastore if it differs from what is on the cluster.
  1. Save and close the file.

Edit mapred-site.xml

Edit the mapred-site.xml file to indicate where the job history logs are stored and to allow MapReduce jobs to run across platforms. 

  1. Navigate to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory and open the  mapred-site.xml file.
  2. Verify the mapreduce.jobhistory.address and mapreduce.app-submission.cross-platform properties are in the mapred-site.xml file. If they are not in the file, add them as follows.
Parameter Value
mapreduce.jobhistory.address Set this to the place where job history logs are stored.
mapreduce.app-submission.cross-platform

Add this property to allow MapReduce jobs to run on either Windows client or Linux server platforms .

<property>
   <name>mapreduce.app-submission.cross-platform</name>
   <value>true</value>
<property>
  1. Save and close the file.

Edit yarn-site.xml

Make changes to these YARN parameters:

  1. Navigate to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory and open the yarn-site.xml file.
  2. Add the following values:
Parameter Values
yarn.application.classpath Add the classpaths you need to run YARN applications.  Use commas to separate multiple paths.  
yarn.resourcemanager.hostname Change to the hostname of the resource manager in your environment.
yarn.resourcemanager.address Change to the hostname and port for your environment.
yarn.resourcemanager.admin.address Change to the hostname and port for your environment.
  1. Save and close the file.

Create a Connection to the CDH Cluster

Creating a connection to the cluster involves setting an active shim, then configuring and testing the connection to the cluster.  Making a shim active means it is used by default when you access a cluster.  When you initially install Pentaho, no shim is active by default.  You must choose a shim to make active before you can connect to a cluster.   Only one shim can be active at a time.  The way you make a shim active, as well as the way you configure and test the cluster connection differs by Pentaho component.

Create and Test a Connection to the Cluster in Spoon

Creating and testing a connection to the CDH cluster from Spoon involves two tasks:

  • Setting the active shim in Spoon
  • Configuring and testing the cluster connection

Set the Active Shim in Spoon

You must set an active shim when you want to connect to a Hadoop cluster the first time, or when you want to switch clusters.  To set a shim as active, complete the following steps:

  1. Start Spoon.
  2. Select Hadoop Distribution... from the Tools menu.

HadoopDistribution.png

  1. In the Hadoop Distribution window, select the Hadoop distribution you want.
  2. Click OK.
  3. Stop, then restart Spoon.

Configure and Test the Cluster Connection

You must provide connection details for the cluster and services you will use, such as the hostname for HDFS or the URL for Oozie.  Then, you can use a built-in tool to test your configuration to find and troubleshoot common configuration issues, such as wrong hostnames and user permission errors.

Connection settings are set in the Hadoop cluster window.  You can get to the settings from several places, but in these instructions, you will get the Hadoop cluster window from the View tab in a transformation or job. Complete the following steps to configure and test a connection:

  1. In the PDI client, create a new job or transformation or open an existing one.
  2. Click the View tab.

  1. Right-click the Hadoop clusters folder, then click New.  The Hadoop cluster window appears.  
  2. Enter the information from the following table in the Hadoop cluster window.  You can get this information from your Hadoop Administrator.

As a best practice, use Kettle variables for each connection parameter value to mitigate risks associated with running jobs and transformations in environments that are disconnected from the repository. 

Option Definition
Cluster Name Name that you assign the cluster connection.
Storage

Specifies the type of storage you want to use for this connection. Use the drop-down box to select one of the following:

  • HDFS: Hadoop Distributed File System, which is typically used for connecting to a Hadoop cluster. This is the default storage selection.
  • MapR: MapR Converged Data Platform. When selected, the fields in the storage and JobTracker sections are disabled because these parameters are not needed to configure MapR.
  • WASB: Windows Azure Storage Blob, which is only available for connecting to Azure HDInsight.
Hostname (in selected storage section) Hostname for the HDFS or WASB node in your Hadoop cluster.
Port (in selected storage section)

Port for the HDFS or WASB node in your Hadoop cluster.  

If your cluster has been enabled for high availability (HA), then you do not need a port number. Clear the port number.

Username (in selected storage section) Username for the HDFS or WASB node.
Password (in selected storage section) Password for the HDFS or WASB node.
Hostname (in JobTracker section) Hostname for the JobTracker node in your Hadoop cluster.  If you have a separate job tracker node, type in the hostname here.
Port (in JobTracker section) Port for the JobTracker in your Hadoop cluster.
Hostname (in ZooKeeper section) Hostname for the ZooKeeper node in your Hadoop cluster.  Supply this only if you want to connect to a ZooKeeper service.
Port (in Zookeeper section) Port for the ZooKeeper node in your Hadoop cluster.  Supply this only if you want to connect to a ZooKeeper service.
URL (in Oozie section) Oozie client address.  Supply this only if you want to connect to the Oozie service.
  1. Click the Test button.  Test results appear in the Hadoop Cluster Test window.  If there are no errors, the connection is properly configured. If you have errors, see the Troubleshoot Cluster and Service Configuration Issues section below to resolve the issues, then test again.

HadoopClusterTest.png

  1. Click Close on the Hadoop Cluster Test window, then click OK to close the Hadoop cluster window.

Copy Spoon Shim Files to Other Pentaho Components

Once your connection has been properly configured on Spoon, you can copy the configuration files to the shim directories in the other Pentaho components. Copy the following configuration files from the pentaho-big-data-plugin/hadoop-configurations/hadoop-configurations/cdhxx directory in Spoon to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory on the Pentaho Server, PRD, or PME: 

  • hbase-site.xml
  • core-site.xml
  • hdfs-site.xml
  • hive-site.xml
  • mapred-site.xml
  • yarn-site.xml

Connect Other Pentaho Components to the Cloudera Cluster

These instructions explain how to create and test a connection to the cluster in the Pentaho Server, PRD, and PME. Creating and testing a connection to the cluster in Spoon involves two tasks:

  • Set the active shim on PRD, PME, and the Pentaho Server
  • Create and test the cluster connections

Set the Active Shim on PRD, PME, and the Pentaho Server

Modify the plugin.properties file to set the active shim for the Pentaho Server, PRD, and PME.

  1. Stop the component.
  2. Locate the pentaho-big-data-plugin directory for your component. 
  3. Navigate to the hadoop-configurations directory.
  4. Navigate to the pentaho-big-data-plugin directory and open the plugin.properties file.
  5. Set the active.hadoop.configuration property to the directory name of the shim you want to make active.  Here is an example:
active.hadoop.configuation=cdh54
  1. Save and close the plugin.properties file.
  2. Restart the component.

Create and Test Connections

Connection tests appear in the following table.

Component Test
Pentaho Server for DI Create a transformation in Spoon and run it remotely.
Pentaho Server for BA Create a connection to the cluster in the Data Source Wizard.
PME Create a connection to the cluster in PME.
PRD Create a connection to the cluster in PRD.

Once you've connected to the cluster and its services properly, provide connection information to users who need access to the cluster and its services.  Those users can only obtain access from computers that have been properly configured to connect to the cluster.

Here is what they need to connect:

  • Hadoop distribution and version of the cluster
  • HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
  • Oozie URL (if used)
  • Users also require the appropriate permissions to access the directories they need on HDFS.  This typically includes their home directory and any other required directories.

They might also need more information depending on the job entries, transformation steps, and services they use.  Here's a more detailed list of information that your users might need from you.

General Notes

Set Hive Database Connection Parameters (Secured Clusters Only)

To access Hive, you need to set several database connection parameters from within Spoon.

  1. Verify the valid Kerberos principal values have been set to Hive.metastore.kerberos.principal and hive.server2.authentication.kerberos.principal in hive-site.xml.

  2. Start Spoon.

  3. In Spoon, open the Database Connection window.

  4. Click Options.

  5. Add the principal parameter and set it to the values that you noted in the hive-site.xml file.​ The principal typically looks like  hive/HiveServer2.pentaho.com@mydomain.

  6. Click OK to close the window.

Sqoop "Unsupported major.minor version" Error

If you are using Pentaho 6.0 and the Java version on your cluster is older than the Java version that Pentaho uses, you must change Pentaho's JDK so it is the same major version as the JDK on the cluster. The JDK that you install for Pentaho must meet the requirements in the Supported Components matrix. To learn how to download and install the JDK read this article

Version-Specific Notes

The following are special topics for CDH.

CDH 5.4 Notes

The following notes address issues with CDH 5.4.

Simba Driver Support Note

If you are using Pentaho 6.0 or later, the CDH 5.4 shim supports the Cloudera JDBC Simba driver: Impala JDBC Connector 2.5.28 for Cloudera Enterprise. This replaces the Apache Hive JDBC that was supported previously in previous versions of the CDH 5.4 shim.

In the Database connection window, you will need to select the Cloudera Impala option. If Impala is secured on your cluster, you also need to supply KrbHostFQDN, KrbServiceName, and KrbRealm in the Options tab. For more information on how to set up a database connection see the database connection articles at help.pentaho.com. 

You will need to install the driver in the shim directory for each Pentaho component (e.g., Spoon, Pentaho Server, PRD) you want to use.  

  1. Download the Impala JDBC Connector 2.5.28 for Cloudera Enterprise driver.
  2. Copy the ImpalaJDBC41.jar to the pentaho-big-data-plugin/hadoop-configurations/cdhxx/lib directory.
  3. Stop and restart the component.

CDH 5.3 Notes

The following notes address issues with CDH 5.3.

Configuring High Availability for CDH 5.3

If you are configuring CDH 5.3 to be used in High Availability mode, we recommend that you use the Cloudera Manager "Download Client Configuration" feature. The Download Client Configuration feature provides a convenient way to get configuration files from the cluster for a service (such as HBase, HDFS, or YARN). Use this feature to download the unzip the configuration zip files to the pentaho-big-data-plugin/hadoop-configurations/cdh53 directory.​

For more information on how to do this, see Cloudera documentation: http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_client_config.html

Troubleshoot Cluster and Service Configuration Issues

The issues in this section explain how to resolve common configuration problems. 

Shim and Configuration Issues

Symptoms Common Causes Common Resolutions

No shim

  • Active shim was not selected.
  • Shim was installed in the  wrong place.
  • Shim name was not entered correctly in the plugin.properties file.
  • Verify that the plugin name that is in the plugin.properties file matches the directory name in the pentaho-big-data-plugin/hadoop-configurations directory
  • Make sure the shim is installed in the correct place.
  • Check the instructions for your Hadoop distribution in the Set Up Pentaho to Connect to a Hadoop Cluster section of the Configuration article for more details on how to verify the plugin name and shim installation directory.
Shim doesn't load
  • Required licenses are not installed.
  • You tried to load a shim that is not supported by your version of Pentaho.
  • If you are using MapR, the client might not have been installed correctly. 
  • Configuration file changes were made incorrectly.
  • Verify the required licenses are installed and have not expired.
  • Verify that the shim is supported by your version of Pentaho. Find your version of Pentaho, then look for the corresponding Components Reference for more details.
  • Verify that configuration file changes were made correctly.  Contact your Hadoop Administrator or see the Set Up Pentaho to Connect to a Hadoop Cluster section of the Configuration article.
  • If you are connecting to MapR, verify that the client was properly installed.  See MapR documentation for details.
  • Restart Spoon, then test again.
  • If this error continues to occur, files might be corrupted.  Download a new copy of the shim from the Pentaho Customer Support Portal.
The file system's URL does not match the URL in the configuration file. Configuration files (*-site.xml files) were not configured properly.  Verify that the configuration files were configured correctly.  Verify that the core-site.xml file is configured correctly.  See the instructions for your Hadoop distribution in the Set Up Pentaho to Connect to a Hadoop Cluster section of the Configuration article for details.

 

Connection Problems

Symptoms Common Causes Common Resolutions
Hostname incorrect or not resolving properly.
  • No hostname has been specified.
  • Hostname/IP Address is incorrect.
  • Hostname is not resolving properly in the DNS.
  • Verify that the Hostname/IP address is correct.
  • Check the DNS to make sure the Hostname is resolving properly. 
Port name is incorrect.
  • Port  number is incorrect.
  • Port number is not numeric.
  • The port number is not necessary for HA clusters.
  • No port number has been specified.
  • Verify that the port number is correct.
  • Determine whether your cluster has been enabled for high availability (HA). If it has, then you do not need a port number. Clear the port number and retest the connection.
Can't connect.
  • Firewall is a barrier to connecting.
  • Other networking issues are occurring.
  • Verify that a firewall is not impeding the connection and that there aren't other network issues. 

Directory Access or Permissions Issues

Symptoms Common Causes Common Resolutions

Can't access directory.

  • Authorization and/or authentication issues.
  • Directory is not on the cluster.
  • Make sure the user has been granted read, write, and execute access to the directory. 
  • Ensure security settings for the cluster and shim allow access.
  • Verify the hostname and port number are correct for the Hadoop File System's namenode. 

Can't create, read, update, or delete files or directories

Authorization and/or authentication issues.

  • Make sure the user has been authorized execute access to the directory. 
  • Ensure security settings for the cluster and shim allow access.
  • Verify that the hostname and port number are correct for the Hadoop File System's namenode. 
Test file cannot be overwritten.  Pentaho test file is already in the directory.
  • A file with the same name as the Pentaho test file is already in the directory.  The test file is used to make sure that the user can create, write, and delete in the user's home directory.
  • The test was run, but the file was not deleted.  You will need to manually delete the test file.  Check the log for the test file name.

Oozie Issues

Symptoms Common Causes Common Resolutions

Can't connect to Oozie.

  • Firewall issue.
  • Other networking issues.
  • Oozie URL is incorrect.
  • Verify that the Oozie URL was correctly entered.
  • Verify that a firewall is not impeding the connection. 

ZooKeeper Problems

Symptoms Common Causes Common Resolutions

Can't connect to ZooKeeper .

  • Firewall is hindering connection with the ZooKeeper service.
  • Other networking issues.
  • Verify that a firewall is not impeding the connection. 

ZooKeeper hostname or port not found or doesn't resolve properly.  

  • Hostname/IP address and port number is missing or is incorrect.
  • Try to connect to the ZooKeeper nodes using ping or another method.
  • Verify that the Hostname/IP address and port numbers are correct.

Create a Connection to the CDH Cluster

Creating a connection to the cluster involves setting an active shim, then configuring and testing the connection to the cluster.  Making a shim active means it is used by default when you access a cluster.  When you initially install Pentaho, no shim is active by default.  You must choose a shim to make active before you can connect to a cluster.   Only one shim can be active at a time.  The way you make a shim active, as well as the way you configure and test the cluster connection differs by Pentaho component.

Create and Test a Connection to the Cluster in Spoon

Creating and testing a connection to the CDH cluster from Spoon involves two tasks:

  • Setting the active shim in Spoon
  • Configuring and testing the cluster connection

Set the Active Shim in Spoon

You must set an active shim when you want to connect to a Hadoop cluster the first time, or when you want to switch clusters.  To set a shim as active, complete the following steps:

  1. Start Spoon.
  2. Select Hadoop Distribution... from the Tools menu.

HadoopDistribution.png

  1. In the Hadoop Distribution window, select the Hadoop distribution you want.
  2. Click OK.
  3. Stop, then restart Spoon.

Configure and Test the Cluster Connection

You must provide connection details for the cluster and services you will use, such as the hostname for HDFS or the URL for Oozie.  Then, you can use a built-in tool to test your configuration to find and troubleshoot common configuration issues, such as wrong hostnames and user permission errors.

Connection settings are set in the Hadoop cluster window.  You can get to the settings from several places, but in these instructions, you will get the Hadoop cluster window from the View tab in a transformation or job. Complete the following steps to configure and test a connection:

  1. In Spoon, create a new job or transformation or open an existing one.
  2. Click the View tab.

clusterss.png

  1. Right-click the Hadoop cluster directory, then click New.  The Hadoop cluster window appears.  
  2. Enter the information from the following table in the Hadoop cluster window.  You can get this information from your Hadoop Administrator.

As a best practice, use Kettle variables for each connection parameter value to mitigate risks associated with running jobs and transformations in environments that are disconnected from the repository. 

HadoopClusterWindow.png

Option Definition
Cluster Name Name that you assign the cluster connection.
Use MapR Client Indicates that this connection is for a MapR cluster.  If this box is checked, the fields in the HDFS and JobTracker sections are disabled because those parameters are not needed to configure MapR.
Hostname (in HDFS section) Hostname for the HDFS node in your Hadoop cluster.
Port (in HDFS section) Port for the HDFS node in your Hadoop cluster.  
Username (in HDFS section) Username for the HDFS node.
Password (in HDFS section) Password for the HDFS node.
Hostname (in JobTracker section) Hostname for the JobTracker node in your Hadoop cluster.  If you have a separate job tracker node, type in the hostname here. Otherwise use the HDFS hostname.
Port (in JobTracker section) Port for the JobTracker in your Hadoop cluster.  Job tracker port number--this cannot be the same as the HDFS port number.
Hostname (in ZooKeeper section) Hostname for the ZooKeeper node in your Hadoop cluster.  Supply this only if you want to connect to a ZooKeeper service.
Port (in Zookeeper section) Port for the ZooKeeper node in your Hadoop cluster.  Supply this only if you want to connect to a ZooKeeper service.
URL (in Oozie section) Oozie client address.  Supply this only if you want to connect to the Oozie service.
  1. Click the Test button.  Test results appear in the Hadoop Cluster Test window.  If there are no errors, the connection is properly configured. If you have errors, see the Troubleshoot Cluster and Service Configuration Issues section below to resolve the issues, then test again.

HadoopClusterTest.png

  1. Click Close on the Hadoop Cluster Test window, then click OK to close the Hadoop cluster window.

Copy Spoon Shim Files to Other Pentaho Components

Once your connection has been properly configured on Spoon, you can copy the configuration files to the shim directories in the other Pentaho components. Copy the following configuration files from the pentaho-big-data-plugin/hadoop-configurations/hadoop-configurations/cdhxx directory in Spoon to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory on the Pentaho Server, PRD, or PME: 

  • hbase-site.xml
  • core-site.xml
  • hdfs-site.xml
  • hive-site.xml
  • mapred-site.xml
  • yarn-site.xml

Connect Other Pentaho Components to the Cloudera Cluster

These instructions explain how to create and test a connection to the cluster in the Pentaho Server, PRD, and PME. Creating and testing a connection to the cluster in Spoon involves two tasks:

  • Set the active shim on PRD, PME, and the Pentaho Server
  • Create and test the cluster connections

Set the Active Shim on PRD, PME, and the Pentaho Server

Modify the plugin.properties file to set the active shim for the Pentaho Server, PRD, and PME.

  1. Stop the component.
  2. Locate the pentaho-big-data-plugin directory for your component. 
  3. Navigate to the hadoop-configurations directory.
  4. Navigate to the pentaho-big-data-plugin directory and open the plugin.properties file.
  5. Set the active.hadoop.configuration property to the directory name of the shim you want to make active.  Here is an example:
active.hadoop.configuation=cdh54
  1. Save and close the plugin.properties file.
  2. Restart the component.

Create and Test Connections

Connection tests appear in the following table.

Component Test
Pentaho Server for DI Create a transformation in Spoon and run it remotely.
Pentaho Server for BA Create a connection to the cluster in the Data Source Wizard.
PME Create a connection to the cluster in PME.
PRD Create a connection to the cluster in PRD.

Once you've connected to the cluster and its services properly, provide connection information to users who need access to the cluster and its services.  Those users can only obtain access from computers that have been properly configured to connect to the cluster.

Here is what they need to connect:

  • Hadoop distribution and version of the cluster
  • HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
  • Oozie URL (if used)
  • Users also require the appropriate permissions to access the directories they need on HDFS.  This typically includes their home directory and any other required directories.

They might also need more information depending on the job entries, transformation steps, and services they use.  Here's a more detailed list of information that your users might need from you.

General Notes

Set Hive Database Connection Parameters (Secured Clusters Only)

To access Hive, you need to set several database connection parameters from within Spoon.

  1. Verify the valid Kerberos principal values have been set to Hive.metastore.kerberos.principal and hive.server2.authentication.kerberos.principal in hive-site.xml.

  2. Start Spoon.

  3. In Spoon, open the Database Connection window.

  4. Click Options.

  5. Add the principal parameter and set it to the values that you noted in the hive-site.xml file.​ The principal typically looks like  hive/HiveServer2.pentaho.com@mydomain.

  6. Click OK to close the window.

Sqoop "Unsupported major.minor version" Error

If you are using Pentaho 6.0 and the Java version on your cluster is older than the Java version that Pentaho uses, you must change Pentaho's JDK so it is the same major version as the JDK on the cluster. The JDK that you install for Pentaho must meet the requirements in the Supported Components matrix. To learn how to download and install the JDK read this article

Version-Specific Notes

The following are special topics for CDH.

CDH 5.4 Notes

The following notes address issues with CDH 5.4.

Simba Driver Support Note

If you are using Pentaho 6.0 or later, the CDH 5.4 shim supports the Cloudera JDBC Simba driver: Impala JDBC Connector 2.5.28 for Cloudera Enterprise. This replaces the Apache Hive JDBC that was supported previously in previous versions of the CDH 5.4 shim.

In the Database connection window, you will need to select the Cloudera Impala option. If Impala is secured on your cluster, you also need to supply KrbHostFQDN, KrbServiceName, and KrbRealm in the Options tab. For more information on how to set up a database connection see the database connection articles at help.pentaho.com. 

You will need to install the driver in the shim directory for each Pentaho component (e.g., Spoon, Pentaho Server, PRD) you want to use.  

  1. Download the Impala JDBC Connector 2.5.28 for Cloudera Enterprise driver.
  2. Copy the ImpalaJDBC41.jar to the pentaho-big-data-plugin/hadoop-configurations/cdhxx/lib directory.
  3. Stop and restart the component.

CDH 5.3 Notes

The following notes address issues with CDH 5.3.

Configuring High Availability for CDH 5.3

If you are configuring CDH 5.3 to be used in High Availability mode, we recommend that you use the Cloudera Manager "Download Client Configuration" feature. The Download Client Configuration feature provides a convenient way to get configuration files from the cluster for a service (such as HBase, HDFS, or YARN). Use this feature to download the unzip the configuration zip files to the pentaho-big-data-plugin/hadoop-configurations/cdh53 directory.​

For more information on how to do this, see Cloudera documentation: http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_client_config.html

Troubleshoot Cluster and Service Configuration Issues

The issues in this section explain how to resolve common configuration problems. 

Shim and Configuration Issues

Symptoms Common Causes Common Resolutions

No shim

  • Active shim was not selected.
  • Shim was installed in the  wrong place.
  • Shim name was not entered correctly in the plugin.properties file.
  • Verify that the plugin name that is in the plugin.properties file matches the directory name in the pentaho-big-data-plugin/hadoop-configurations directory
  • Make sure the shim is installed in the correct place.
  • Check the instructions for your Hadoop distribution in the Set Up Pentaho to Connect to a Hadoop Cluster section of the Configuration article for more details on how to verify the plugin name and shim installation directory.
Shim doesn't load
  • Required licenses are not installed.
  • You tried to load a shim that is not supported by your version of Pentaho.
  • If you are using MapR, the client might not have been installed correctly. 
  • Configuration file changes were made incorrectly.
  • Verify the required licenses are installed and have not expired.
  • Verify that the shim is supported by your version of Pentaho. Find your version of Pentaho, then look for the corresponding Components Reference for more details.
  • Verify that configuration file changes were made correctly.  Contact your Hadoop Administrator or see the Set Up Pentaho to Connect to a Hadoop Cluster section of the Configuration article.
  • If you are connecting to MapR, verify that the client was properly installed.  See MapR documentation for details.
  • Restart Spoon, then test again.
  • If this error continues to occur, files might be corrupted.  Download a new copy of the shim from the Pentaho Customer Support Portal.
The file system's URL does not match the URL in the configuration file. Configuration files (*-site.xml files) were not configured properly.  Verify that the configuration files were configured correctly.  Verify that the core-site.xml file is configured correctly.  See the instructions for your Hadoop distribution in the Set Up Pentaho to Connect to a Hadoop Cluster section of the Configuration article for details.

 

Connection Problems

Symptoms Common Causes Common Resolutions
Hostname incorrect or not resolving properly.
  • No hostname has been specified.
  • Hostname/IP Address is incorrect.
  • Hostname is not resolving properly in the DNS.
  • Verify that the Hostname/IP address is correct.
  • Check the DNS to make sure the Hostname is resolving properly. 
Port name is incorrect.
  • No port number has been specified.
  • Port  number is incorrect.
  • Port number is not numeric.
  • Verify that the port number is correct.
  • If you don't have a port number, determine whether your cluster has been enabled for high availability. If it has, then you do not need a port number.
Can't connect.
  • Firewall is a barrier to connecting.
  • Other networking issues are occurring.
  • Verify that a firewall is not impeding the connection and that there aren't other network issues. 

Directory Access or Permissions Issues

Symptoms Common Causes Common Resolutions

Can't access directory.

  • Authorization and/or authentication issues.
  • Directory is not on the cluster.
  • Make sure the user has been granted read, write, and execute access to the directory. 
  • Ensure security settings for the cluster and shim allow access.
  • Verify the hostname and port number are correct for the Hadoop File System's namenode. 

Can't create, read, update, or delete files or directories

Authorization and/or authentication issues.

  • Make sure the user has been authorized execute access to the directory. 
  • Ensure security settings for the cluster and shim allow access.
  • Verify that the hostname and port number are correct for the Hadoop File System's namenode. 
Test file cannot be overwritten.  Pentaho test file is already in the directory.
  • A file with the same name as the Pentaho test file is already in the directory.  The test file is used to make sure that the user can create, write, and delete in the user's home directory.
  • The test was run, but the file was not deleted.  You will need to manually delete the test file.  Check the log for the test file name.

Oozie Issues

Symptoms Common Causes Common Resolutions

Can't connect to Oozie.

  • Firewall issue.
  • Other networking issues.
  • Oozie URL is incorrect.
  • Verify that the Oozie URL was correctly entered.
  • Verify that a firewall is not impeding the connection. 

ZooKeeper Problems

Symptoms Common Causes Common Resolutions

Can't connect to ZooKeeper .

  • Firewall is hindering connection with the ZooKeeper service.
  • Other networking issues.
  • Verify that a firewall is not impeding the connection. 

ZooKeeper hostname or port not found or doesn't resolve properly.  

  • Hostname/IP address and port number is missing or is incorrect.
  • Try to connect to the ZooKeeper nodes using ping or another method.
  • Verify that the Hostname/IP address and port numbers are correct.

Create a Connection to the CDH Cluster

Creating a connection to the cluster involves setting an active shim, then configuring and testing the connection to the cluster.  Making a shim active means it is used by default when you access a cluster.  When you initially install Pentaho, no shim is active by default.  You must choose a shim to make active before you can connect to a cluster.   Only one shim can be active at a time.  The way you make a shim active, as well as the way you configure and test the cluster connection differs by Pentaho component.

Create and Test a Connection to the Cluster in Spoon

Creating and testing a connection to the CDH cluster from Spoon involves two tasks:

  • Setting the active shim in Spoon
  • Configuring and testing the cluster connection

Set the Active Shim in Spoon

You must set an active shim when you want to connect to a Hadoop cluster the first time, or when you want to switch clusters.  To set a shim as active, complete the following steps:

  1. Start Spoon.
  2. Select Hadoop Distribution... from the Tools menu.

HadoopDistribution.png

  1. In the Hadoop Distribution window, select the Hadoop distribution you want.
  2. Click OK.
  3. Stop, then restart Spoon.

Configure and Test the Cluster Connection

You must provide connection details for the cluster and services you will use, such as the hostname for HDFS or the URL for Oozie.  Then, you can use a built-in tool to test your configuration to find and troubleshoot common configuration issues, such as wrong hostnames and user permission errors.

Connection settings are set in the Hadoop cluster window.  You can get to the settings from several places, but in these instructions, you will get the Hadoop cluster window from the View tab in a transformation or job. Complete the following steps to configure and test a connection:

  1. In Spoon, create a new job or transformation or open an existing one.
  2. Click the View tab.

clusterss.png

  1. Right-click the Hadoop cluster directory, then click New.  The Hadoop cluster window appears.  
  2. Enter the information from the following table in the Hadoop cluster window.  You can get this information from your Hadoop Administrator.

As a best practice, use Kettle variables for each connection parameter value to mitigate risks associated with running jobs and transformations in environments that are disconnected from the repository. 

HadoopClusterWindow.png

Option Definition
Cluster Name Name that you assign the cluster connection.
Use MapR Client Indicates that this connection is for a MapR cluster.  If this box is checked, the fields in the HDFS and JobTracker sections are disabled because those parameters are not needed to configure MapR.
Hostname (in HDFS section) Hostname for the HDFS node in your Hadoop cluster.
Port (in HDFS section) Port for the HDFS node in your Hadoop cluster.  
Username (in HDFS section) Username for the HDFS node.
Password (in HDFS section) Password for the HDFS node.
Hostname (in JobTracker section) Hostname for the JobTracker node in your Hadoop cluster.  If you have a separate job tracker node, type in the hostname here. Otherwise use the HDFS hostname.
Port (in JobTracker section) Port for the JobTracker in your Hadoop cluster.  Job tracker port number--this cannot be the same as the HDFS port number.
Hostname (in ZooKeeper section) Hostname for the ZooKeeper node in your Hadoop cluster.  Supply this only if you want to connect to a ZooKeeper service.
Port (in Zookeeper section) Port for the ZooKeeper node in your Hadoop cluster.  Supply this only if you want to connect to a ZooKeeper service.
URL (in Oozie section) Oozie client address.  Supply this only if you want to connect to the Oozie service.
  1. Click the Test button.  Test results appear in the Hadoop Cluster Test window.  If there are no errors, the connection is properly configured. If you have errors, see the Troubleshoot Cluster and Service Configuration Issues section below to resolve the issues, then test again.

HadoopClusterTest.png

  1. Click Close on the Hadoop Cluster Test window, then click OK to close the Hadoop cluster window.

Copy Spoon Shim Files to Other Pentaho Components

Once your connection has been properly configured on Spoon, you can copy the configuration files to the shim directories in the other Pentaho components. Copy the following configuration files from the pentaho-big-data-plugin/hadoop-configurations/hadoop-configurations/cdhxx directory in Spoon to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory on the Pentaho Server, PRD, or PME: 

  • hbase-site.xml
  • core-site.xml
  • hdfs-site.xml
  • hive-site.xml
  • mapred-site.xml
  • yarn-site.xml

Connect Other Pentaho Components to the Cloudera Cluster

These instructions explain how to create and test a connection to the cluster in the Pentaho Server, PRD, and PME. Creating and testing a connection to the cluster in Spoon involves two tasks:

  • Set the active shim on PRD, PME, and the Pentaho Server
  • Create and test the cluster connections

Set the Active Shim on PRD, PME, and the Pentaho Server

Modify the plugin.properties file to set the active shim for the Pentaho Server, PRD, and PME.

  1. Stop the component.
  2. Locate the pentaho-big-data-plugin directory for your component. 
  3. Navigate to the hadoop-configurations directory.
  4. Navigate to the pentaho-big-data-plugin directory and open the plugin.properties file.
  5. Set the active.hadoop.configuration property to the directory name of the shim you want to make active.  Here is an example:
active.hadoop.configuation=cdh54
  1. Save and close the plugin.properties file.
  2. Restart the component.

Create and Test Connections

Connection tests appear in the following table.

Component Test
Pentaho Server for DI Create a transformation in Spoon and run it remotely.
Pentaho Server for BA Create a connection to the cluster in the Data Source Wizard.
PME Create a connection to the cluster in PME.
PRD Create a connection to the cluster in PRD.

Once you've connected to the cluster and its services properly, provide connection information to users who need access to the cluster and its services.  Those users can only obtain access from computers that have been properly configured to connect to the cluster.

Here is what they need to connect:

  • Hadoop distribution and version of the cluster
  • HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
  • Oozie URL (if used)
  • Users also require the appropriate permissions to access the directories they need on HDFS.  This typically includes their home directory and any other required directories.

They might also need more information depending on the job entries, transformation steps, and services they use.  Here's a more detailed list of information that your users might need from you.

General Notes

Set Hive Database Connection Parameters (Secured Clusters Only)

To access Hive, you need to set several database connection parameters from within Spoon.

  1. Verify the valid Kerberos principal values have been set to Hive.metastore.kerberos.principal and hive.server2.authentication.kerberos.principal in hive-site.xml.

  2. Start Spoon.

  3. In Spoon, open the Database Connection window.

  4. Click Options.

  5. Add the principal parameter and set it to the values that you noted in the hive-site.xml file.​ The principal typically looks like  hive/HiveServer2.pentaho.com@mydomain.

  6. Click OK to close the window.

Sqoop "Unsupported major.minor version" Error

If you are using Pentaho 6.0 and the Java version on your cluster is older than the Java version that Pentaho uses, you must change Pentaho's JDK so it is the same major version as the JDK on the cluster. The JDK that you install for Pentaho must meet the requirements in the Supported Components matrix. To learn how to download and install the JDK read this article

Version-Specific Notes

The following are special topics for CDH.

CDH 5.4 Notes

The following notes address issues with CDH 5.4.

Simba Driver Support Note

If you are using Pentaho 6.0 or later, the CDH 5.4 shim supports the Cloudera JDBC Simba driver: Impala JDBC Connector 2.5.28 for Cloudera Enterprise. This replaces the Apache Hive JDBC that was supported previously in previous versions of the CDH 5.4 shim.

In the Database connection window, you will need to select the Cloudera Impala option. If Impala is secured on your cluster, you also need to supply KrbHostFQDN, KrbServiceName, and KrbRealm in the Options tab. For more information on how to set up a database connection see the database connection articles at help.pentaho.com. 

You will need to install the driver in the shim directory for each Pentaho component (e.g., Spoon, Pentaho Server, PRD) you want to use.  

  1. Download the Impala JDBC Connector 2.5.28 for Cloudera Enterprise driver.
  2. Copy the ImpalaJDBC41.jar to the pentaho-big-data-plugin/hadoop-configurations/cdhxx/lib directory.
  3. Stop and restart the component.

CDH 5.3 Notes

The following notes address issues with CDH 5.3.

Configuring High Availability for CDH 5.3

If you are configuring CDH 5.3 to be used in High Availability mode, we recommend that you use the Cloudera Manager "Download Client Configuration" feature. The Download Client Configuration feature provides a convenient way to get configuration files from the cluster for a service (such as HBase, HDFS, or YARN). Use this feature to download the unzip the configuration zip files to the pentaho-big-data-plugin/hadoop-configurations/cdh53 directory.​

For more information on how to do this, see Cloudera documentation: http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_client_config.html

Troubleshoot Cluster and Service Configuration Issues

The issues in this section explain how to resolve common configuration problems. 

Shim and Configuration Issues

Symptoms Common Causes Common Resolutions

No shim

  • Active shim was not selected.
  • Shim was installed in the  wrong place.
  • Shim name was not entered correctly in the plugin.properties file.
  • Verify that the plugin name that is in the plugin.properties file matches the directory name in the pentaho-big-data-plugin/hadoop-configurations directory
  • Make sure the shim is installed in the correct place.
  • Check the instructions for your Hadoop distribution in the Set Up Pentaho to Connect to a Hadoop Cluster section of the Configuration article for more details on how to verify the plugin name and shim installation directory.
Shim doesn't load
  • Required licenses are not installed.
  • You tried to load a shim that is not supported by your version of Pentaho.
  • If you are using MapR, the client might not have been installed correctly. 
  • Configuration file changes were made incorrectly.
  • Verify the required licenses are installed and have not expired.
  • Verify that the shim is supported by your version of Pentaho. Find your version of Pentaho, then look for the corresponding Components Reference for more details.
  • Verify that configuration file changes were made correctly.  Contact your Hadoop Administrator or see the Set Up Pentaho to Connect to a Hadoop Cluster section of the Configuration article.
  • If you are connecting to MapR, verify that the client was properly installed.  See MapR documentation for details.
  • Restart Spoon, then test again.
  • If this error continues to occur, files might be corrupted.  Download a new copy of the shim from the Pentaho Customer Support Portal.
The file system's URL does not match the URL in the configuration file. Configuration files (*-site.xml files) were not configured properly.  Verify that the configuration files were configured correctly.  Verify that the core-site.xml file is configured correctly.  See the instructions for your Hadoop distribution in the Set Up Pentaho to Connect to a Hadoop Cluster section of the Configuration article for details.

 

Connection Problems

Symptoms Common Causes Common Resolutions
Hostname incorrect or not resolving properly.
  • No hostname has been specified.
  • Hostname/IP Address is incorrect.
  • Hostname is not resolving properly in the DNS.
  • Verify that the Hostname/IP address is correct.
  • Check the DNS to make sure the Hostname is resolving properly. 
Port name is incorrect.
  • No port number has been specified.
  • Port  number is incorrect.
  • Port number is not numeric.
  • Verify that the port number is correct.
  • If you don't have a port number, determine whether your cluster has been enabled for high availability. If it has, then you do not need a port number.
Can't connect.
  • Firewall is a barrier to connecting.
  • Other networking issues are occurring.
  • Verify that a firewall is not impeding the connection and that there aren't other network issues. 

Directory Access or Permissions Issues

Symptoms Common Causes Common Resolutions

Can't access directory.

  • Authorization and/or authentication issues.
  • Directory is not on the cluster.
  • Make sure the user has been granted read, write, and execute access to the directory. 
  • Ensure security settings for the cluster and shim allow access.
  • Verify the hostname and port number are correct for the Hadoop File System's namenode. 

Can't create, read, update, or delete files or directories

Authorization and/or authentication issues.

  • Make sure the user has been authorized execute access to the directory. 
  • Ensure security settings for the cluster and shim allow access.
  • Verify that the hostname and port number are correct for the Hadoop File System's namenode. 
Test file cannot be overwritten.  Pentaho test file is already in the directory.
  • A file with the same name as the Pentaho test file is already in the directory.  The test file is used to make sure that the user can create, write, and delete in the user's home directory.
  • The test was run, but the file was not deleted.  You will need to manually delete the test file.  Check the log for the test file name.

Oozie Issues

Symptoms Common Causes Common Resolutions

Can't connect to Oozie.

  • Firewall issue.
  • Other networking issues.
  • Oozie URL is incorrect.
  • Verify that the Oozie URL was correctly entered.
  • Verify that a firewall is not impeding the connection. 

ZooKeeper Problems

Symptoms Common Causes Common Resolutions

Can't connect to ZooKeeper .

  • Firewall is hindering connection with the ZooKeeper service.
  • Other networking issues.
  • Verify that a firewall is not impeding the connection. 

ZooKeeper hostname or port not found or doesn't resolve properly.  

  • Hostname/IP address and port number is missing or is incorrect.
  • Try to connect to the ZooKeeper nodes using ping or another method.
  • Verify that the Hostname/IP address and port numbers are correct.