Skip to main content
Pentaho Documentation

Big Data Issues

Follow the suggestions in these topics to help resolve common issues when working with Big Data:

  • General Configuration Problems
  • Cannot Access Cluster with Kerberos Enabled
  • Cannot Access a Hive Cluster
  • Cannot use Keytab File to Authenticate Access to PMR Cluster
  • HBase Get Master Failed Error
  • Sqoop Import into Hive Fails

  • Cannot Start Any Pentaho Components after Setting MapR as Active Hadoop Configuration
  • Kettle Cluster on YARN Will Not Start

See Pentaho Troubleshooting articles for additional topics.

General Configuration Problems

The issues in this section explain how to resolve common configuration problems. 

Shim and Configuration Issues

Symptoms Common Causes Common Resolutions

No shim

  • Active shim was not selected.
  • Shim was installed in the wrong place.
  • Shim name was not entered correctly in the plugin.properties file.
  • Verify that the plugin name that is in the plugin.properties file matches the directory name in the pentaho-big-data-plugin/hadoop-configurations directory
  • Verify the shim is installed in the correct place.
  • Check the instructions for your Hadoop distribution in the Set Up Pentaho to Connect to a Hadoop Cluster article for more details on how to verify the plugin name and shim installation directory.
Shim does not load
  • Required licenses are not installed.
  • You tried to load a shim that is not supported by your version of Pentaho.
  • If you are using MapR, the client might not have been installed correctly. 
  • Configuration file changes were made incorrectly.
The file system's URL does not match the URL in the configuration file.
  • Configuration files (*-site.xml files) were not configured properly. 
  • Verify that the configuration files were configured correctly.
  • Verify that the core-site.xml file is configured correctly. See the instructions for your Hadoop distribution in the Set Up Pentaho to Connect to a Hadoop Cluster article for details.

Connection Problems

Symptoms Common Causes Common Resolutions
Hostname does not resolve.
  • No hostname has been specified.
  • Hostname/IP Address is incorrect.
  • Hostname is not resolving properly in the DNS.
  • Verify that the Hostname/IP address is correct.
  • Check the DNS to make sure the Hostname is resolving properly. 
Port number does not resolve.
  • No port number has been specified.
  • Port number is incorrect.
  • Port number is not numeric.
  • Verify that the port number is correct.
  • If you do not have a port number, determine whether your cluster has been enabled for high availability. If it has, then you do not need a port number.
Cannot connect to the cluster
  • Firewall is a barrier to connecting.
  • Other networking issues are occurring.
  • Verify that a firewall is not impeding the connection and there are no other network issues. 

Directory Access or Permissions Issues

Symptoms Common Causes Common Resolutions

Cannot access directory

  • Authorization and/or authentication issues.
  • Directory is not on the cluster.
  • Verify the user has been granted read, write, and execute access to the directory. 
  • Verify security settings for the cluster and shim allow access.
  • Verify the hostname and port number are correct for the Hadoop File System's namenode

Cannot create, read, update, or delete files or directories

  • Authorization and/or authentication issues.
  • Verify the user has been authorized execute access to the directory. 
  • Verify security settings for the cluster and shim allow access.
  • Verify that the hostname and port number are correct for the Hadoop File System's namenode
Test file cannot be overwritten.
  • Pentaho test file is already in the directory.
  • The test was run, but the file was not deleted. You will need to manually delete the test file. Check the log for the test file name.
    • A file with the same name as the Pentaho test file is already in the directory. The test file is used to make sure that the user can create, write, and delete in the user's home directory.

Oozie Issues

Symptoms Common Causes Common Resolutions

Cannot connect to Oozie

  • Firewall issue.
  • Other networking issues.
  • Oozie URL is incorrect.
  • Verify that the Oozie URL was correctly entered.
  • Verify that a firewall is not impeding the connection. 

Zookeeper Problems

Symptoms Common Causes Common Resolutions

Cannot connect to Zookeeper

  • Firewall is impeding connection with the Zookeeper service.
  • Other networking issues.
  • Verify that a firewall is not impeding the connection. 

Zookeeper hostname or port not found or does not resolve properly.  

  • Hostname/IP Address and Port name is missing or is incorrect.
  • Try to connect to the Zookeeper nodes using ping or another method.
  • Verify that the Hostname/IP Address and Port numbers are correct.

Cannot Access Cluster with Kerberos Enabled

If a step or entry cannot access a Kerberos authenticated cluster, review the steps in Use Impersonation to Access a MapR Cluster.

If this issue persists, verify the username, password, UID, and GID for each impersonated or spoofed user is the same on each node. When a user is deleted and recreated, it may then have different UIDs and GIDs causing this issue.

Cannot Access a Hive Cluster

If you cannot use Kerberos impersonation to authenticate and access a Hive cluster, review the steps in Use Impersonation to Access a MapR Cluster.

If this issue persists, copy the hive-site.xml file on the Hive server to the MapR distribution in these directories: 

  • Pentaho Server: pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[mapr distribution]

  • PDI Client: data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[mapr distribution]

If this still does not work, disable pooled connections for Hive.

Cannot use Keytab File to Authenticate Access to PMR Cluster

If you cannot authenticate and gain access to the PMR cluster, copy the keytab file to each task tracker node on the PMR cluster.

HBase Get Master Failed Error

If the HBase cannot negation the authenticated portion of the connection error occurs, copy the hbase-site.xml file from the HBase server to the MapR distribution in these directories:

  • Pentaho Server: pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/[mapr distribution]

  • PDI Client: data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/[mapr distribution]

Sqoop Import into Hive Fails

If executing a Sqoop import into Hive fails to execute on a remote installation, the local Hive installation configuration does not match the Hadoop cluster connection information used to perform the Sqoop job.

Verify the Hadoop connection information used by the local Hive installation is configured the same as the Sqoop job entry.

Cannot Start Any Pentaho Components after Setting MapR as Active Hadoop Configuration

If you set MapR to be your active Hadoop configuration, but you cannot start any Pentaho component (Pentaho Server, Spoon, Report Designer, or the Metadata Editor), make sure that you have completed proper configuration of MapR.

As you review the instructions for configuring MapR, make sure that you have copied the required JAR files to the pentaho-big-data-plugin/hadoop-configurations/mapr3x folders for each component listed.  For information on how to configure MapR, see the following references:

The 'Group by' Step is not Supported in a Single Threaded Transformation Engine 

If you have a job that contains both a Pentaho MapReduce entry and a Reducer transformation with a Group by step, you may receive a Step 'Group by' of type 'GroupBy' is not Supported in a Single Threaded Transformation Engine error message. This error can occurs if:

  • An entire set of rows sharing the same grouping key are filtered from the transformation before the Group By step
  • The Reduce single threaded option in the Pentaho MapReduce entry's Reducer tab is selected.

To fix this issue, open the Pentaho MapReduce entry and deselect the Reduce single threaded option in the Reducer tab.

Kettle Cluster on YARN Will Not Start

When you are using the Start a PDI Cluster on YARN job entry, the Kettle cluster may not start.

Verify the Default FS setting matches the configured hostname for the HDFS Name node, then try starting the kettle cluster again.