Skip to main content
Pentaho Documentation

Connect to a Hadoop Cluster in Spoon

Overview

Explains how to connect to a Hadoop cluster.

To connect Pentaho to a Hadoop cluster you will need to do two things:

  1. Set the active shim
  2. Create and test the connection

A shim is a bit like an adapter that enables Pentaho to connect to a Hadoop distribution, like Cloudera Distribution for Hadoop (CDH).  The active shim is used by default when you run big data transformations, jobs, and reports. When you first install Pentaho, no shim is active, so this is the first thing you need to do before you try to connect to a Hadoop cluster.

After the active shim is set, you must configure, then test the connection.  Spoon has built in tools to help you do this.

Before You Begin

Before you begin, make sure that your Hadoop Administrator has granted you permission to access the HDFS directories you need.  This typically includes your home directory as well as any other directories you need to do your work.  Your Hadoop Administrator should have already configured Pentaho to connect to the Hadoop cluster on your computer.  For more details on how to do this, see the Set Up Pentaho to Connect to an Apache Hadoop Cluster article.  You also need to know these things:

  • Distribution and version of the cluster (e.g. Cloudera Distribution 5.4)
  • IP Addresses and Port Numbers for HDFS, JobTracker, and Zookeeper (if used).
  • Oozie URL (if used)

Set the Active Shim in Spoon

Set the active shim when you want to connect to a Hadoop cluster the first time, or when you want to switch clusters.  Only one shim can be active at a time.

  1. Start Spoon.
  2. Select Hadoop Distribution... from the Tools menu.

HadoopDistribution.png

  1. In the Hadoop Distribution window, select the Hadoop distribution you want.
  2. Click OK.
  3. Stop, then restart Spoon.

Configure and Test the Cluster Connection

Configured connection information is available for reuse in other steps and entries.  Whether you are connected to the Pentaho Repository when you create the connection determines who can reuse it.

  • If you are connected to the Pentaho Repository when you create the connection, you and other users can reuse the connection.    
  • If you are not connected to the Pentaho Repository when you create the connection, only you can reuse the connection. 

Open the Hadoop Cluster Window

Connection settings are set in the Hadoop cluster window.  You can get to the settings from these places:

  • Steps and Entries
  • View tab in a transformation or job
  • Repository Explorer window

Steps and Entries

  1. Create a new job or transformation or open an existing one.
  2. Add a step or entry that can connect to a Hadoop cluster to the Spoon canvas.  
  3. Click the New button next to the Hadoop Cluster field. The Hadoop cluster window appears. 
  4. Configure and Test the Hadoop Cluster connection.

View Tab

  1. In Spoon, create a new job or transformation or open an existing one.
  2. Click the View tab.

clusterss.png

  1. Right-click the Hadoop cluster folder, then click New.  The Hadoop cluster window appears.  
  2. Configure and Test the Hadoop Cluster connection.

Repository Explorer

  1. In Spoon, connect to the repository where you want to store the transformation or job.
  2. Select Repository from the Tools menu.
  3. Select Explore to open the Repository Explorer window.
  4. Click the Hadoop clusters tab.
  5. Click the New button. The Hadoop Cluster window appears.
  6. Configure and Test the Hadoop Cluster connection.

Configure and Test Connection

Once you have opened the Hadoop cluster window from a step or entry, the View tab, or the Repository Explorer window, configure the connection.

  1. Enter information in the Hadoop cluster window.  You can get most of the information you need from your Hadoop Administrator.

As a best practice, use Kettle variables for each connection parameter value to mitigate risks associated with running jobs and transformations in environments that are disconnected from the repository. 

HadoopClusterWindow.png

Option Definition
Cluster Name Name that you assign the cluster connection.
Use MapR Client Indicates that this connection is for a MapR cluster.  If this box is checked, the fields in the HDFS and JobTracker sections are disabled because those parameters are not needed to configure MapR.
Hostname (in HDFS section) Hostname for the HDFS node in your Hadoop cluster.
Port (in HDFS section) Port for the HDFS node in your Hadoop cluster.  
Username (in HDFS section) Username for the HDFS node.
Password (in HDFS section) Password for the HDFS node.
Hostname (in JobTracker section) Hostname for the JobTracker node in your Hadoop cluster.  If you have a separate job tracker node, type in the hostname here. Otherwise use the HDFS hostname.
Port (in JobTracker section) Port for the JobTracker in your Hadoop cluster.  Job tracker port number; this cannot be the same as the HDFS port number.
Hostname (in ZooKeeper section) Hostname for the Zookeeper node in your Hadoop cluster.  Supply this only if you want to connect to a Zookeeper service.
Port (in Zookeeper section) Port for the Zookeeper node in your Hadoop cluster.  Supply this only if you want to connect to a Zookeeper service.
URL (in Oozie section) Oozie client address.  Supply this only if you want to connect to the Oozie service.
  1. Click the Test button.  Test results appear in the Hadoop Cluster Test window.  If you have problems, see Troubleshoot Connection Issues to resolve the issues, then test again.

HadoopClusterTest.png

  1. If there are no more errors, congratulations!  The connection is properly configured.  Click the Close button to the remaining Hadoop Cluster Test window.
  2. When complete, click the OK button to close the Hadoop cluster window.

Troubleshoot Connection Issues

General Configuration Problems

The issues in this section explain how to resolve common configuration problems. 

Shim and Configuration Issues

Symptoms Common Causes Common Resolutions

No shim

  • Active shim was not selected.
  • Shim was installed in the  wrong place.
  • Shim name was not entered correctly in the plugin.properties file.
  • Verify that the plugin name that is in the plugin.properties file matches the directory name in the pentaho-big-data-plugin/hadoop-configurations directory
  • Make sure the shim is installed in the correct place.
  • Check the instructions for your Hadoop distribution in the Set Up Pentaho to Connect to an Apache Hadoop Cluster article for more details on how to verify the plugin name and shim installation directory.
Shim doesn't load
  • Required licenses are not installed.
  • You tried to load a shim that is not supported by your version of Pentaho.
  • If you are using MapR, the client might not have been installed correctly. 
  • Configuration file changes were made incorrectly.
  • Verify the required licenses are installed and have not expired.
  • Verify that the shim is supported by your version of Pentaho. Find your version of Pentaho, then look for the corresponding Components Reference for more details.
  • Verify that configuration file changes were made correctly.  Contact your Hadoop Administrator or see the Set Up Pentaho to Connect to an Apache Hadoop Cluster article.
  • If you are connecting to MapR, verify that the client was properly installed.  See MapR documentation for details.
  • Restart Spoon, then test again.
  • If this error continues to occur, files might be corrupted.  Download a new copy of the shim from the Pentaho Customer Support Portal.
The file system's URL does not match the URL in the configuration file. Configuration files (*-site.xml files) were not configured properly.  Verify that the configuration files were configured correctly.  Verify that the core-site.xml file is configured correctly.  See the instructions for your Hadoop distribution in the Set Up Pentaho to Connect to an Apache Hadoop Cluster article for details.

 

Connection Problems

Symptoms Common Causes Common Resolutions
Hostname incorrect or not resolving properly.
  • No hostname has been specified.
  • Hostname/IP Address is incorrect.
  • Hostname is not resolving properly in the DNS.
  • Verify that the Hostname/IP address is correct.
  • Check the DNS to make sure the Hostname is resolving properly. 
Port name is incorrect.
  • No port number has been specified.
  • Port  number is incorrect.
  • Port number is not numeric.
  • Verify that the port number is correct.
  • If you don't have a port number, determine whether your cluster has been enabled for high availability. If it has, then you do not need a port number.
Can't connect.
  • Firewall is a barrier to connecting.
  • Other networking issues are occurring.
  • Verify that a firewall is not impeding the connection and that there aren't other network issues. 

Directory Access or Permissions Issues

Symptoms Common Causes Common Resolutions

Can't access directory.

  • Authorization and/or authentication issues.
  • Directory is not on the cluster.
  • Make sure the user has been granted read, write, and execute access to the directory. 
  • Ensure security settings for the cluster and shim allow access.
  • Verify the hostname and port number are correct for the Hadoop File System's namenode. 

Can't create, read, update, or delete files or directories

Authorization and/or authentication issues.

  • Make sure the user has been authorized execute access to the directory. 
  • Ensure security settings for the cluster and shim allow access.
  • Verify that the hostname and port number are correct for the Hadoop File System's namenode. 
Test file cannot be overwritten.  Pentaho test file is already in the directory.
  • A file with the same name as the Pentaho test file is already in the directory.  The test file is used to make sure that the user can create, write, and delete in the user's home directory.
  • The test was run, but the file was not deleted.  You will need to manually delete the test file.  Check the log fo the test file name.

Oozie Issues

Symptoms Common Causes Common Resolutions

Can't connect to Oozie.

  • Firewall issue.
  • Other networking issues.
  • Oozie URL is incorrect.
  • Verify that the Oozie URL was correctly entered.
  • Verify that a firewall is not impeding the connection. 

Zookeeper Problems

Symptoms Common Causes Common Resolutions

Can't connect to Zookeeper.

  • Firewall is hindering connection with the Zookeeper service.
  • Other networking issues.
  • Verify that a firewall is not impeding the connection. 

Zookeeper hostname or port not found or doesn't resolve properly.  

  • Hostname/IP Address and Port name is missing or is incorrect.
  • Try to connect to the Zookeeper nodes using ping or another method.
  • Verify that the Hostname/IP Address and Port numbers are correct.

Manage Existing Hadoop Cluster Connections

Once cluster connections have been created, you can manage them.

  • Edit Hadoop Cluster Connections
  • Duplicate Hadoop Cluster Connections
  • Delete Hadoop Cluster Connections

Edit Hadoop Cluster Connections

How updates occur depend on whether you are connected to the repository.

  • If you are connected to a repository:  Hadoop Cluster connection changes are picked up by all transformations and jobs in the repository.   The Hadoop Cluster connection information is loaded during execution unless it cannot be found.  If the connection information cannot be found, the connection values that were stored when the transformation or job was saved are used instead.
  • If you are not connected to a repository:   Hadoop Cluster connection changes are only picked up by your local (file system) transformations and jobs.  If you run these transformations and jobs outside of Kettle, they will not have access to the Hadoop Cluster connection, so a copy of the connection is saved as a fallback.  Note that changes to the Hadoop Cluster connection are not updated in any transformations or jobs for the purpose of fallback unless they are saved again.  

You can edit Hadoop cluster connections in three places:

  • Steps and entries
  • View tab
  • Repository Explorer window

Steps and Entries

To edit Hadoop cluster connection in a step or entry, complete these steps.  

  1. Open the Hadoop cluster window in a step or entry.
  2. Make changes, then click Test.
  3. Click the OK button.

View Tab

To edit Hadoop cluster connection from the transformation or job View tab, complete these steps.  

  1. Click the Hadoop Clusters folder in the View tab.
  2. Right-click a connection, then select Edit
  3. The Hadoop cluster window appears. 
  4. Make changes, then click Test.
  5. Click the OK button.

Repository Explorer

To edit Hadoop cluster connection from the Repository Explorer window, do the following.

  1. Click the Hadoop Clusters tab in the Repository Explorer window.
  2. Select a connection, then click Edit.
  3. The Hadoop cluster window appears. 
  4. Make changes, then click Test.
  5. Click the OK button.

Duplicate a Hadoop Cluster Connection

You can only duplicate or clone a Hadoop Cluster connection in the Spoon View tab. 

  1. Click the Hadoop clusters folder in the View tab.
  2. Right-click a connection and select Duplicate.
  3. The Hadoop cluster window appears.  Enter a different name in the Cluster Name field.
  4. Make changes, then click Test.
  5. Click the OK button.

Delete a Hadoop Cluster Connection

Deleted connections cannot be restored.  But, you can still run transformations and jobs that reference them because deleted connections details are stored in the transformation and job metadata files.

You can delete Hadoop cluster connections in two places:

  • View tab
  • Repository Explorer window

View Tab

To delete Hadoop cluster connection in a transformation or job, complete these steps.

  1. Click the Hadoop clusters folder in the View tab.
  2. Right-click a Hadoop cluster connection and select Delete.
  3. A message appears asking whether you really want to delete the connection.  Click Yes.

Repository Explorer

To delete Hadoop cluster connections  from the Repository Explorer window, do the following.

  1. Connect to the Repository Explorer.
  2. Click the Hadoop Clusters tab.
  3. Select a Hadoop cluster connection, then click Delete.
  4. A message appears asking if you really want to delete the Hadoop cluster connection.  Click Yes.