Set up Pentaho to connect to a Hortonworks cluster

Before you begin

Before you begin, you'll need to do a few things.

Procedure

Check the Components Reference to verify that your Pentaho version supports your version of the HDP cluster.
Set up a HDP cluster.
Pentaho can connect to a HDP clusters:
1. Configure an HDP cluster.
  See Hortonwork's documentation if you need help.
2. Install any required services and service client tools.
3. Test the cluster.
Get the connection information for the cluster and services that you will use from your Hadoop administrator, or from Ambari or other cluster management tools.
Add the YARN user on the cluster to the group defined by dfs.permissions.superusergroup property. The dfs.permissions.superusergroup property can be found in hdfs-site.xml file on your cluster or in the cluster management application.
Read the Notes section to review special configuration instructions for your version of HDP.

Setup a secured cluster

If you are connecting to a HDP cluster that is secured with Kerberos, you must also perform the following actions:

Procedure

Configure Kerberos security on the cluster, including the Kerberos Realm, Kerberos KDC, and Kerberos Administrative Server.
Configure the name, data, secondary name, job tracker, and task tracker nodes to accept remote connection requests.
If you are have deployed Hadoop using an enterprise-level program, set up Kerberos for name, data, secondary name, job tracker, and task tracker nodes.
Add the user account credential for each PDI client user that should have access to the Hadoop cluster to the Kerberos database.
Make sure there is an operating system user account on each node in the Hadoop cluster for each user that you want to add to the Kerberos database.
Add operating system user accounts if necessary.
NoteThe user account UIDs must be greater than the minimum user ID value (min.user.id). Usually, the minimum user ID value is set to 1000.
Set up Kerberos on your Pentaho computers. Instructions for how to do this appear in Set Up Kerberos for Pentaho

Edit configuration files on clusters

Pentaho-specific edits to configuration files are the cluster are referenced in this section.

Oozie

The Oozie user runs Oozie jobs by default. If you use PDI to start an Oozie job, you must add the PDI user to the oozie-site.xml file on the cluster so that the PDI user can execute the program in proxy. If you plan to use the Oozie service complete these instructions:

Procedure

Open the oozie-site.xml file on the cluster.

Add the following lines of the code to the oozie-site.xml file on cluster, substituting <your_pdi_user_name> with the PDI user name, such as jdoe.

<property>
<name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.groups</name>
<value>*</value>
</property>
<property>
<name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.hosts</name>
<value>*</value>
</property>

Save and close the file.

Configure Pentaho component shims

You must configure the shim in each of the following Pentaho components, on each computer from which Pentaho will be used to connect to the cluster:

PDI client (Spoon)
Pentaho Server, including Analyzer and Pentaho Interactive Reports.
Pentaho Report Designer (PRD)
Pentaho Metadata Editor (PME)

As a best practice, configure the shim in the PDI client first. The PDI client has features that will help you test your configuration. Then copy the tested PDI client configuration files to other components, making changes if necessary.

You can also opt to go through these instructions for each Pentaho component, and not copy the shim files from the PDI client. If you do not plan to connect to the cluster from the PDI client, you can configure the shim in another component first instead.

Step 1: Locate the Pentaho Big Data plugin and shim directories

Shims and other parts of the Pentaho Adaptive Big Data Layer are in the Pentaho Big Data Plugin directory. The path to this directory differs by component. You need to know the locations of this directory, for each component, to complete shim configuration and testing tasks.

Note<pentaho home> is the directory where Pentaho is installed.

Components	Location of Pentaho Big Data Plugin Directory
PDI client	`<pentaho home>`/design-tools/data-integration/plugins/pentaho-big-data-plugin
Pentaho Server	`<pentaho home>`/server/pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin
Pentaho Report Designer	`<pentaho home>`/design-tools/report-designer/plugins/pentaho-big-data-plugin
Pentaho Metadata Editor	`<pentaho home>`/design-tools/metadata-editor/plugins/pentaho-big-data-plugin

Shims are located in the pentaho-big-data-plugin/hadoop-configurations directory. Shim directory names consist of a three or four-letter Hadoop Distribution abbreviation followed by the Hadoop Distribution's version number. The version number does not contain a decimal point. For example, the shim directory named cdh512 is the shim for the CDH (Cloudera Distribution for Hadoop), version 5.12. Here is a list of the shim directory abbreviations.

Abbreviation	Shim
cdh	Cloudera's Distribution of Apache Hadoop
emr	Amazon Elastic Map Reduce
hdi	Microsoft Azure HDInsight
hdp	Hortonworks Data Platform
mapr	MapR

Step 2: Select the correct shim

Although Pentaho often supports one or more versions of a Hadoop distribution, the download of the Pentaho Suite only contains the latest, supported, Pentaho-certified version of the shim. The other supported versions of shims can be downloaded from the Pentaho Customer Support Portal.

Before you begin, verify that the shim you want is supported by your version of Pentaho shown in the Components Reference.

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations directory to view the shim directories.
If the shim you want to use is already there, you can go to Step 3: Copy the configuration files from cluster to shim.
On the Customer Portal home page, sign in using the Pentaho support user name and password provided to you in your Pentaho Welcome Packet.
In the search box, enter the name of the shim you want, then select the shim from the search results.
(Optional) You can browse the shims by version on the Downloads page.
Read all prerequisites, warnings, and instructions.
On the bottom of the page in the Box widget, click the shim ZIP file to download it.
Unzip the downloaded shim package into the pentaho-big-data-plugin/hadoop-configurations directory.

Step 3: Copy the configuration files from cluster to shim

Copying configuration files from the cluster to the shim helps keep key configuration settings in sync with the cluster and reduces configuration errors.

Procedure

Back up the existing HDP shim files in the pentaho-big-data-plugin/hadoop-configurations/hdpxx directory.
Copy the following configuration files from the HDP cluster to pentaho-big-data-plugin/hadoop-configurations/hdpxx (overwriting the existing files):
- core-site.xml
- hbase-site.xml
- hdfs-site.xml
- hive-site.xml
- mapred-site.xml
- yarn-site.xml

Step 4: Edit the shim configuration files

You need to verify or change authentication, Oozie, Hive, MapReduce, and YARN settings in the following files:

core-site.xml
config.properties
hbase-site.xml
hive-site.xml
mapred-site.xml
yarn-site.xml

Edit configuration properties (unsecured cluster)

If you are connecting to an unsecure cluster, perform the following steps:

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/hdpxx directory and open the config.properties file.
(Optional) To access the Oozie service through a proxy, add the proxy user name to the pentaho.oozie.proxy.user parameter.
If you are not using a proxy, leave the parameter set to oozie.
Verify the pentaho.authentication.default.mapping.impersonation.type parameter is set to disabled.
If not, change it to disabled.
Add the java.system.hdp.version parameter and set it to the version of your HDP cluster.
For HDP 2.2, the version is 2.2.0.0-2041.
Save and close the file.

Edit configuration properties (secured cluster)

If you are connecting to a secure cluster, add Kerberos information to the config.properties file. If you plan to use secure impersonation to access your cluster, see Use secure impersonation with Hortonworks before editing the config.properties file.

Perform the following steps to add Kerberos information to the config.properties file:

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/hdpxx directory and open the config.properties file with any text editor.
If you plan to access the Oozie service through a proxy, add the proxy user's name to the pentaho.oozie.proxy.user parameter. Otherwise, leave it set to oozie.

Add the following parameters and values to the config.properties file:

Parameter	Value
authentication.superuser.provider	`hdp-kerberos`. This should match the `authentication.kerberos.id` value.
authentication.kerberos.id	`hdp-kerberos`
authentication.kerberos.principal	Set to the Kerberos principal. This should be a service principal.
authentication.kerberos.password	Set to the Kerberos password. Set either the password or the keytab, not both.
authentication.kerberos.keytabLocation	Set to the Kerberos keytab location. Set either the password or the path to the keytab, not both.
authentication.kerberos.class	Set to `org.pentaho.di.core.auth.KerberosAuthenticationProvider`
authentication.provider.list	Set to `authentication.kerberos`
activator.classes	Set to `org.pentaho.hadoop.shim.common.authorization.EEAuthActivator`
java.system.hdp.version	HDP Version. For HDP 2.2, this is 2.2.0.0-2041

Your code should look similar to the following example:

authentication.superuser.provider=hdp-kerberos
authentication.kerberos.id=hdp-kerberos
authentication.kerberos.principal=exampleUser@EXAMPLE.COM
authentication.kerberos.password=MyPassword
authentication.kerberos.keytabLocation=C:\kerberos\MyKeytab
authentication.kerberos.class=org.pentaho.di.core.auth.KerberosAuthenticationProvider
authentication.provider.list=authentication.kerberos
activator.classes=org.pentaho.hadoop.shim.common.authorization.EEAuthActivator

Comment out the following parameters in the SECURITY CONFIGURATIONS section:

pentaho.authentication.default.kerberos.keytabLocation
pentaho.authentication.default.kerberos.password
pentaho.authentication.default.mapping.impersonation.type
pentaho.authentication.default.mapping.server.credentials.kerberos.principal
pentaho.authentication.default.mapping.server.credentials.kerberos.keytabLocation
pentaho.authentication.default.mapping.server.credentials.kerberos.password.

Save and close the file.
If you are on a Windows machine, perform the following additonal steps to also update the CATALINA_OPTS environment variable in the start-pentaho.bat file:
1. Navigate to the server/pentaho-server directory and open the start-pentaho.bat file with any text editor.
2. Set the CATALINA_OPTS environment variable to the location of the krb5.conf or krb5.ini file on your system, as shown in the following example:
```
set “CATALINA_OPTS=%“-Djava.security.krb5.conf=C:\kerberos\krb5.conf
```
3. Save and close the file.

Edit HBase site XML file

Edit the location of the temporary directory in the hbase-site.xml file to create an HBase local storage directory.

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/hdpxx directory and open the hbase-site.xml file.
Add the following value:

Parameter Value
hbase.tmp.dir /tmp/hadoop/hbase
Save and close the file.

Parameter	Value
hbase.tmp.dir	/tmp/hadoop/hbase

Edit Hive site XML file

Verify that the following parameter is set in the hive-site.xml file:

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/hdpxx directory and open the hive-site.xml file.
Add the following value:

Parameter Value
hive.metastore.uris Set this to the location of your hive metastore.
Save and close the file.

Parameter	Value
hive.metastore.uris	Set this to the location of your hive metastore.

Edit Mapred site XML file

Edit the mapred-site.xml file to indicate where the job history logs are stored and to allow MapReduce jobs to run across platforms.

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/hdpxx directory and open the mapred-site.xml file.

Add the following values:

Parameter	Value
mapreduce.jobhistory.address	Set this to the folder where you want to store the job history logs.
mapreduce.application.classpath	Add classpath information. Here is an example: <property> <name>mapreduce.application.classpath</name> <value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/* :$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/* :$PWD/mr-framework/hadoop/share/hadoop/common/:$PWD/mr-framework/hadoop/share/hadoop/common/lib/ :$PWD/mr-framework/hadoop/share/hadoop/yarn/:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/ :$PWD/mr-framework/hadoop/share/hadoop/hdfs/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/ :/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure </value> </property>
mapreduce.application.framework.path	Set the framework path. Here is an example: <property> <name>mapreduce.application.framework.path</name> <value>/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework</value> </property>

Verify the mapreduce.app-submission.cross-platform property is in the mapred-site.xml file. If it is not in the file, add it as follows.

Parameter	Value
mapreduce.app-submission.cross-platform	Add this property to allow MapReduce jobs to run on either Windows client or Linux server platforms. <property> <name>mapreduce.app-submission.cross-platform</name> <value>true</value> </property>

Save and close the file.

Edit YARN site XML file

Verify that the following parameters are set in the yarn-site.xml file.

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/hdpxx directory and open the yarn-site.xml file.

Add these values:

Parameter	Value
yarn.application.classpath	Add the classpaths needed to run YARN applications, as shown in the following example: <property> <name>yarn.application.classpath</name> <value>$HADOOP_CONF_DIR,/usr/hdp/current/hadoop-client/, /usr/hdp/current/hadoop-client/lib/,/usr/hdp/current/hadoop-hdfs-client/, /usr/hdp/current/hadoop-hdfs-client/lib/,/usr/hdp/current/hadoop-yarn-client/, /usr/hdp/current/hadoop-yarn-client/lib/</value> </property> Use commas to separate multiple paths.
yarn.resourcemanager.hostname	Update the hostname in your environment or use the default: sandbox.hortonworks.com
yarn.resourcemanager.address	Update the hostname and port for your environment.
yarn.resourcemanager.admin.address	Update the hostname and port for your environment.

Save and close the file.

Connect to a Hadoop cluster with the PDI client

Once you have set up your shim, you must make it active, then configure and test the connection to the cluster. For details on setting up the connection, see the article Connect to a Hadoop Cluster with the PDI Client.

Connect other Pentaho components to the Hortonworks cluster

These instructions explain how to create and test a connection to the cluster in the Pentaho Server, PRD, and PME. Creating and testing a connection to the cluster in the PDI client involves two tasks:

Setting the active shim on PRD, PME, and the Pentaho Server
Configuring and testing the cluster connections.

Set the active shim on PRD, PME, and Pentaho Server

Modify the plugin.properties file to set the active shim for the Pentaho Server, PRD, and PME.

Procedure

Stop the component.
Locate the pentaho-big-data-plugin directory for your component.
Navigate to the hadoop-configurations directory.
Navigate to the pentaho-big-data-plugin directory and open the plugin.properties file.
Set the active.hadoop.configuration property to the directory name of the shim you want to make active. Here is an example:
```
active.hadoop.configuation=hdp24
```
Save and close the plugin.properties file.
Restart the component.

Create and test connections

Connection tests appear in the following table.

Component	Test
Pentaho Server for DI	Create a transformation in the PDI client and run it remotely.
Pentaho Server for BA	Create a connection to the cluster in the Data Source Wizard.
PME	Create a connection to the cluster in PME.
PRD	Create a connection to the cluster in PRD.

Once you have connected to the cluster and its services properly, provide connection information to users who need access to the cluster and its services. Those users can only obtain access from computers that have been properly configured to connect to the cluster.

Here is what they need to connect:

Hadoop Distribution and version of the cluster
HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
Oozie URL (if used)
Users also require the appropriate permissions to access the directories they need on HDFS. This typically includes their home directory and any other required directories.

They might also need more information depending on the job entries, transformation steps, and services they use. Here's a more detailed list of information that your users might need from you.

Notes

The following notes are special topics for HDP.

HDP 3.1 notes

The following note addresses issues related to HDP 3.1.

Using the 3.0 shim for HDP 3.1 clusters

You can use the HDP 3.0 shim to connect to your HDP 3.1 cluster by updating the PDI config.properties file.

Perform the following steps to update your java.syste.hdp.version shim configuration parameter to HDP 3.1:

Procedure

On your HDP cluster, use the hdp-select command to determine the full version of your cluster, such as '3.1.0.0-78'.
In the Pentaho distribution, open the config.properties file located in the design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/hdp30 directory.
Change the java.system.hdp.version parameter from the existing version to the full version of your cluster, which you obtained by running the hdp-select command in Step 1. For example, the existing version of '3.0.0.0-1634' might be changed to '3.1.0.0-78'.
Save and close the config.properties file.

Results

Your HDP 3.0 shim will now work with your 3.1 HDP cluster once you restart your PDI client.

HDP 2.5 notes

The following note address issues with HDP 2.5.

Sqoop support

If you receive an error message stating, Generating splits for a textual index column allowed only in case of "-Dorg.apache.sqoop.splitter.allow_text_splitter=true property passed as a parameter while trying to use the split-by option to the Sqoop Import job entry with the HDP 2.5 shim, perform the following steps to set the org.apache.sqoop.splitter.allow_text_splitter property to true:

Procedure

Open your KJB file that contains a Sqoop Import entry in the PDI client.
Double-click the Sqoop Import entry to access the Sqoop Import property dialog box.
Click the Advanced Options link in the lower left corner of the dialog box.
In the Custom tab, add the Dorg.apache.sqoop.splitter.allow_text_splitter argument and set the value to true.
Click OK and save your KJB file.

Results

You should now be able to use the split-by option to the Sqoop Import entry.

Java system HDP version

The config.properties file in the HDP 2.5 shim contains a property and value which is currently set to:

java.system.hdp.version=2.5.0.0-1245

If this property and the exact version number is not set correctly to match the version of HDP 2.5 that is running in your Pentaho system, your Pentaho map reduce jobs will fail.

Pentaho uses this property and version parameter to locate a folder in the /hdp/apps folder on hdfs that contains dependencies needed to run the map reduce job. You can determine the current value of this property by logging into the cluster and issuing the command:

hadoop fs -ls /hdp/apps

The resulting output should be similar to:

Found 1 items
drwxr-xr-x   - hdfs hdfs          0 2017-02-16 11:03 /hdp/apps/2.5.3.0-37

In the example above, the correct setting for the property and version number line is:

java.system.hdp.version=2.5.3.0-37

You must edit your config.properties file to update the java.system.hdp property with the exact version number of HDP 2.5 that is running in your Pentaho system.

HDP 2.4 notes

The following note address issues with HDP 2.4.

Simba Spark SQL driver support

If you are using Pentaho 7.0 or later, the HDP 2.4 shim supports the Simba Spark SQL driver. You will need to download, install, and configure the driver to use Simba Spark SQL with the HDP 2.4 shim.

Procedure

Download the Simba Spark SQL driver.
Extract the ZIP file, and then copy the following 3 files into the lib/ directory of the HDP shim:
- SparkJDBC41.jar
- TCLIServiceClient.jar
- QI.jar
In the Database Connection window, select SparkSQL option.
The default port for the Spark thrift server is 10015.
For secure connections, set the following additional parameters on the JDBC URL through the Options tab:
- KrbServiceName
- KrbHostFQDN
- KrbRealm
For unsecure connections, if your Spark SQL configuration specifies hive.server2.authentication=NONE, then make sure to include an appropriate User Name in the Database Connection window.
Otherwise, the connection is assumed to be NOSASL authentication, which will cause a connection failure after timeout.
Stop and restart the component.

HDP 2.3 notes

The following note addresses issues with HDP 2.3.

Pentaho can connect to HDP 2.3 cluster using the HDP 2.2 or HDP 2.3 shims

You can use either the HDP 2.2 or HDP 2.3 shims to connect to a HDP 2.3 clusters:

If you use the HDP 2.2 shim to connect to an HDP 2.3 cluster, only HDP 2.2 functionality is supported.
If you want to support HDP 2.3 functionality, use the HDP 2.3 shim to connect to the HDP 2.3 cluster instead.

Shims can be downloaded from the Pentaho Customer Support Portal.

For troubleshooting cluster and service configuration Issues, refer to Big Data issues.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Before you begin

Setup a secured cluster

Edit configuration files on clusters

Oozie

Configure Pentaho component shims

Step 1: Locate the Pentaho Big Data plugin and shim directories

Step 2: Select the correct shim

Step 3: Copy the configuration files from cluster to shim

Step 4: Edit the shim configuration files

Edit configuration properties (unsecured cluster)

Edit configuration properties (secured cluster)

Edit HBase site XML file

Edit Hive site XML file

Edit Mapred site XML file

Edit YARN site XML file

Connect to a Hadoop cluster with the PDI client

Connect other Pentaho components to the Hortonworks cluster

Set the active shim on PRD, PME, and Pentaho Server

Create and test connections

Notes

HDP 3.1 notes

Using the 3.0 shim for HDP 3.1 clusters

HDP 2.5 notes

Sqoop support

Java system HDP version

HDP 2.4 notes

Simba Spark SQL driver support

HDP 2.3 notes

Pentaho can connect to HDP 2.3 cluster using the HDP 2.2 or HDP 2.3 shims