Set up Pentaho to connect to a Cloudera cluster
Before you begin
Procedure
Check the Components Reference to verify that your Pentaho version supports your version of the CDH cluster.
Set up a Cloudera cluster.
Pentaho can connect to a CDH cluster:Configure an CDH cluster.
See Cloudera's documentation if you need help.Install any required services and service client tools.
Test the cluster.
Get the connection information for the cluster and services that you will use from your Hadoop administrator, Cloudera Manager, or from other cluster management tools. You will also need to supply some of this information to users once you are finished.
Add the YARN user on the cluster to the group defined by dfs.permissions.superusergroup property. The dfs.permissions.superusergroup property can be found in hdfs-site.xml file on your cluster or in the Cloudera Manager.
Read the Notes section to review special configuration instructions for your version of CDH.
Setup a secured cluster
Procedure
Configure Kerberos security on the cluster, including the Kerberos Realm, Kerberos KDC, and Kerberos Administrative Server.
Configure the name, data, secondary name, job tracker, and task tracker nodes to accept remote connection requests.
If you are have deployed CDH using an enterprise-level program, set up Kerberos for name, data, secondary name, job tracker, and task tracker nodes.
Add user account credentials to the Kerberos database for each Pentaho user that needs access to the Hadoop cluster.
Make sure there is an operating system user account on each node in the Hadoop cluster for each user that you want to add to the Kerberos database.
Add operating system user accounts if necessary.NoteThe user account UIDs should be greater than the minimum user ID value (min.user.id). Usually, the minimum user ID value is set to 1000.Set up Kerberos on your Pentaho computers. Instructions for how to do this appear in the article Set Up Kerberos for Pentaho.
Edit configuration files on clusters
Pentaho-specific edits to configuration files are the cluster are referenced in this section.
Oozie
Procedure
Open the oozie-site.xml file on the cluster.
Add the following lines of the code to the oozie-site.xml file on cluster, substituting <your_pdi_user_name> with the PDI user name, such as jdoe.
<property> <name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.groups</name> <value>*</value> </property> <property> <name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.hosts</name> <value>*</value> </property>
Save and close the file.
Configure Pentaho component shims
You must configure the shim in each of the following Pentaho components, on each computer from which Pentaho will be used to connect to the cluster:
- PDI client (Spoon)
- Pentaho Server, including Analyzer and Pentaho Interactive Reports.
- Pentaho Report Designer (PRD)
- Pentaho Metadata Editor (PME)
As a best practice, configure the shim in the PDI client first. The PDI client has features that will help you test your configuration. Then copy the tested PDI client configuration files to other components, making changes if necessary.
You can also opt to go through these instructions for each Pentaho component, and not copy the shim files from the PDI client. If you do not plan to connect to the cluster from the PDI client, you can configure the shim in another component first instead.
Step 1: Locate the Pentaho Big Data plugin and shim directories
Shims and other parts of the Pentaho Adaptive Big Data Layer are in the Pentaho Big Data Plugin directory. The path to this directory differs by component. You need to know the locations of this directory, for each component, to complete shim configuration and testing tasks.
Components | Location of Pentaho Big Data Plugin Directory |
PDI client | <pentaho home>/design-tools/data-integration/plugins/pentaho-big-data-plugin |
Pentaho Server | <pentaho home>/server/pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin |
Pentaho Report Designer | <pentaho home>/design-tools/report-designer/plugins/pentaho-big-data-plugin |
Pentaho Metadata Editor | <pentaho home>/design-tools/metadata-editor/plugins/pentaho-big-data-plugin |
Shims are located in the pentaho-big-data-plugin/hadoop-configurations directory. Shim directory names consist of a three or four-letter Hadoop Distribution abbreviation followed by the Hadoop Distribution's version number. The version number does not contain a decimal point. For example, the shim directory named cdh512 is the shim for the CDH (Cloudera Distribution for Hadoop), version 5.12. Here is a list of the shim directory abbreviations.
Abbreviation | Shim |
cdh | Cloudera's Distribution of Apache Hadoop |
emr | Amazon Elastic Map Reduce |
hdi | Microsoft Azure HDInsight |
hdp | Hortonworks Data Platform |
mapr | MapR |
Step 2: Select the correct shim
Procedure
Navigate to the pentaho-big-data-plugin/hadoop-configurations directory to view the shim directories.
If the shim you want to use is already there, you can go to Step 3: Copy the configuration files from cluster to shim.On the Customer Portal home page, sign in using the Pentaho support user name and password provided to you in your Pentaho Welcome Packet.
In the search box, enter the name of the shim you want, then select the shim from the search results.
(Optional) You can browse the shims by version on the Downloads page.Read all prerequisites, warnings, and instructions.
On the bottom of the page in the Box widget, click the shim ZIP file to download it.
Unzip the downloaded shim package to the pentaho-big-data-plugin/hadoop-configurations directory.
Step 3: Copy the configuration files from cluster to shim
Copying configuration files from the cluster to the shim helps keep key configuration settings in sync with the cluster and reduces configuration errors. Perform the following steps to copy these configuration file from the cluster to the shim:
Procedure
Back up the CDH shim files in the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory.
Copy the following configuration files from the cluster to the Pentaho shim directory. You should overwrite the existing Pentaho shim files.
- core-site.xml
- hbase-site.xml
- hdfs-site.xml
- hive-site.xml
- mapred-site.xml
- yarn-site.xml.
Step 4: Edit the shim configuration files
You need to verify or change authentication, Oozie, Hive, MapReduce, and YARN settings in these shim configuration files:
- core-site.xml
- config.properties
- hive-site.xml
- mapred-site.xml
- yarn-site.xml.
Edit configuration properties (unsecured cluster)
Procedure
Navigate to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory and open the config.properties file.
(Optional) To access the Oozie service through a proxy, add the proxy user name to the pentaho.oozie.proxy.user parameter.
If you are not using a proxy, leave the parameter set to oozie.Verify the pentaho.authentication.default.mapping.impersonation.type parameter is set to disabled.
If not, change it todisabled
.Save and close the file.
Edit configuration properties (secured cluster)
Perform the following steps to add Kerberos information to the config.properties file:
Procedure
Navigate to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory and open the config.properties file with any text editor.
If you plan to access the Oozie service through a proxy, add the proxy user's name to the pentaho.oozie.proxy.user parameter. Otherwise, leave it set to oozie.
Add the following parameters and values to the config.properties file:
Your code should look similar to the following example:Parameter Value authentication.superuser.provider cdh-kerberos. This should match the authentication.kerberos.id
value.authentication.kerberos.id cdh-kerberos authentication.kerberos.principal Set to the Kerberos principal. This should be a service principal. authentication.kerberos.password Set to the Kerberos password. Set either the password or the keytab, not both. authentication.kerberos.keytabLocation Set to the Kerberos keytab location. Set either the password or the path to the keytab, not both. authentication.kerberos.class Set to org.pentaho.di.core.auth.KerberosAuthenticationProvider authentication.provider.list Set to authentication.kerberos activator.classes Set to org.pentaho.hadoop.shim.common.authorization.EEAuthActivator authentication.superuser.provider=cdh-kerberos authentication.kerberos.id=cdh-kerberos authentication.kerberos.principal=exampleUser@EXAMPLE.COM authentication.kerberos.password=MyPassword authentication.kerberos.keytabLocation=C:\kerberos\MyKeytab authentication.kerberos.class=org.pentaho.di.core.auth.KerberosAuthenticationProvider authentication.provider.list=authentication.kerberos activator.classes=org.pentaho.hadoop.shim.common.authorization.EEAuthActivator
Comment out the following parameters in the SECURITY CONFIGURATIONS section:
#pentaho.authentication.default.kerberos.keytabLocation #pentaho.authentication.default.kerberos.password #pentaho.authentication.default.mapping.impersonation.type #pentaho.authentication.default.mapping.server.credentials.kerberos.principal #pentaho.authentication.default.mapping.server.credentials.kerberos.keytabLocation #pentaho.authentication.default.mapping.server.credentials.kerberos.password.
Save and close the file.
If you are on a Windows machine, perform the following additonal steps to also update the CATALINA_OPTS environment variable in the start-pentaho.bat file:
Navigate to the server/pentaho-server directory and open the start-pentaho.bat file with any text editor.
Set the CATALINA_OPTS environment variable to the location of the krb5.conf or krb5.ini file on your system, as shown in the following example:
set "CATALINA_OPTS=%"-Djava.security.krb5.conf=C:\kerberos\krb5.conf
Save and close the file.
Edit Hive site XML file
Procedure
Navigate to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory and open the hive-site.xml file.
Add the following value:
Parameter Value hive.metastore.uris Set this to the location of your hive metastore if it differs from what is on the cluster. Save and close the file.
Edit Mapred site XML file
Procedure
Navigate to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory and open the mapred-site.xml file.
Verify the mapreduce.jobhistory.address and mapreduce.app-submission.cross-platform properties are in the mapred-site.xml file. If they are not in the file, add them as follows.
Parameter Value mapreduce.jobhistory.address Set this to the place where job history logs are stored. mapreduce.app-submission.cross-platform Add this property to allow MapReduce jobs to run on either Windows client or Linux server platforms.
<property> <name>mapreduce.app-submission.cross-platform</name> <value>true</value> </property>
Save and close the file.
Edit YARN site XML file
Procedure
Navigate to the pentaho-big-data-plugin/hadoop-configurations/cdhxx directory and open the yarn-site.xml file.
Add the following values:
Parameter Value yarn.application.classpath Add the classpaths you need to run YARN applications. Use commas to separate multiple paths. yarn.resourcemanager.hostname Change to the hostname of the resource manager in your environment. yarn.resourcemanager.address Change to the hostname and port for your environment. yarn.resourcemanager.admin.address Change to the hostname and port for your environment. Save and close the file.
Connect to a Hadoop cluster with the PDI client
Once you have set up your shim, you must make it active, then configure and test the connection to the cluster. For details on setting up the connection, see the article Connect to a Hadoop cluster with the PDI client.
Connect other Pentaho components to the Cloudera cluster
These instructions explain how to create and test a connection to the cluster in the Pentaho Server, PRD, and PME. Creating and testing a connection to the cluster in the PDI client involves two tasks:
- Set the active shim on PRD, PME, and the Pentaho Server
- Create and test the cluster connections
Set the active shim on PRD, PME, and Pentaho Server
Procedure
Stop the component.
Locate the pentaho-big-data-plugin directory for your component.
Navigate to the hadoop-configurations directory.
Navigate to the pentaho-big-data-plugin directory and open the plugin.properties file.
Set the active.hadoop.configuration property to the directory name of the shim you want to make active. Here is an example:
active.hadoop.configuation=cdh512
Save and close the plugin.properties file.
Restart the component.
Create and test connections
Connection tests appear in the following table.
Component | Test |
Pentaho Server for DI | Create a transformation in the PDI client and run it remotely. |
Pentaho Server for BA | Create a connection to the cluster in the Data Source Wizard. |
PME | Create a connection to the cluster in PME. |
PRD | Create a connection to the cluster in PRD. |
Once you have connected to the cluster and its services properly, provide connection information to users who need access to the cluster and its services. Those users can only obtain access from computers that have been properly configured to connect to the cluster.
Here is what they need to connect:
- Hadoop distribution and version of the cluster
- HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
- Oozie URL (if used)
- Users also require the appropriate permissions to access the directories they need on HDFS. This typically includes their home directory and any other required directories.
They might also need more information depending on the job entries, transformation steps, and services they use. Here's a more detailed list of information that your users might need from you.
Notes
The following are special topics for CDH.
CDH 5.4 notes
The following notes address issues with CDH 5.4.
Simba driver support note
In the Database connection window, you will need to select the Cloudera Impala option. If Impala is secured on your cluster, you also need to supply KrbHostFQDN, KrbServiceName, and KrbRealm in the Options tab.
You will need to install the driver in the shim directory for each Pentaho component (e.g., the PDI client, Pentaho Server, PRD) you want to use.
Procedure
Download the Impala JDBC Connector 2.5.28 for Cloudera Enterprise driver.
Copy the ImpalaJDBC41.jar to the pentaho-big-data-plugin/hadoop-configurations/cdhxx/lib directory.
Stop and restart the component.
CDH 5.3 notes
The following notes address issues with CDH 5.3.
Configuring high availability for CDH 5.3
If you are configuring CDH 5.3 to be used in High Availability mode, we recommend that you use the Cloudera Manager Download Client Configuration feature. The Download Client Configuration feature provides a convenient way to get configuration files from the cluster for a service (such as HBase, HDFS, or YARN). Use this feature to download and unzip the configuration ZIP files to the pentaho-big-data-plugin/hadoop-configurations/cdh53 directory.
For more information on how to do this, see Cloudera documentation: http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_client_config.html
For troubleshooting cluster and service configuration issues, refer to Big Data issues.