Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Set up Pentaho to connect to a MapR cluster

Parent article

Before you begin

Before you begin, you'll need to do a few things.

Procedure

  1. Check the Components Reference to verify that your Pentaho version supports your version of the HDP cluster.

  2. Set up a MapR cluster.

    Pentaho can connect to secured and unsecured MapR clusters:
    1. Configure a MapR cluster.

      See MapR's documentation if you need help.
    2. Install any required services and service client tools.

    3. Test the cluster.

  3. Set up MapR client

    1. Install the MapR client, then test to make sure it is properly installed on your computer and is able to connect to and browse your MapR cluster. For more information on how to do this, visit the MapR site.

    2. Set the MAPR_HOME environment variable to the installation location of the MapR client.

      NoteIf you are installing MapR 4.0.1 on Windows, use version 4.0.1.31009GA or later as your MapR client. If you are using MapR 4.1.0, use version 4.1.0.31175GA of the MapR client. The software can be obtained from MapR.
  4. Read the Notes section to review special configuration instructions for your version of MapR.

Setup a secured cluster

If you are connecting to a secured MapR cluster there are a few additional things you need to do.

Procedure

  1. Secure the MapR cluster with Kerberos.

    Pentaho supports Kerberos authentication. You will need to:
    1. Configure Kerberos security on the cluster, including the Kerberos Realm, Kerberos KDC, and Kerberos Administrative Server.

    2. Configure the name, data, secondary name, job tracker, and task tracker nodes to accept remote connection requests.

    3. Set up Kerberos for name, data, secondary name, job tracker, and task tracker nodes if you are have deployed Hadoop using an enterprise-level program.

    4. Add the user account credential for each Pentaho user that should have access to the Hadoop cluster to the Kerberos database.

      Make sure there is an operating system user account on each node in the Hadoop cluster for each user that you want to add to the Kerberos database. Add operating system user accounts if necessary.
      NoteThe user account UIDs must be greater than the minimum user ID value (min.user.id). Usually, the minimum user ID value is set to 1000.
  2. Set up Kerberos on your Pentaho computers.

    Instructions for how to do this appear in Set Up Kerberos for Pentaho.
  3. Set up impersonation.

    1. If you will be using impersonation, you will also need to complete the steps in the Use Kerberos with MapR article.

    2. If you plan to use spoofing or impersonation to connect to the MapR client, specify the appropriate User ID (UID), Group ID (GID), and name as indicated in the MapR documentation.

      NoteMake sure that the account that you use for spoofing is created the client and on each node. Each spoofing account should have the same UID and GID as the one on the client.

Next steps

There are no edits that need to be made to the *-site.xml configuration files on the cluster.

Configure Pentaho component shims

You must configure the shim in each of the following Pentaho components, on each computer from which Pentaho will be used to connect to the MapR cluster:

  • PDI client (Spoon)
  • Pentaho Server
  • Pentaho Report Designer (PRD)
  • Pentaho Metadata Editor (PME)

As a best practice, configure the shim in the PDI client first. The PDI client has features that will help you test your configuration. Then copy the tested PDI client configuration files to other components, making changes if necessary.

You can also opt to go through these instructions for each Pentaho component, and not copy the shim files from the PDI client.

NoteIf you do not plan to connect to the cluster from the PDI client, you can configure the shim in another component first.

Step 1: Locate the Pentaho Big Data plugin and shim directories

Shims and other parts of the Pentaho Adaptive Big Data Layer are in the Pentaho Big Data Plugin directory. The path to this directory differs by component. You need to know the locations of this directory, for each component, to complete shim configuration and testing tasks.

Notein the following table, <pentaho home> in the shim locations for each component is the directory where Pentaho is installed.
ComponentsLocation of Pentaho Big Data Plugin Directory
PDI client<pentaho home>/design-tools/data-integration/plugins/pentaho-big-data-plugin
Pentaho Server<pentaho home>/server/pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin
Pentaho Report Designer<pentaho home>/design-tools/report-designer/plugins/pentaho-big-data-plugin
Pentaho Metadata Editor<pentaho home>/design-tools/metadata-editor/plugins/pentaho-big-data-plugin

Shims are located in the pentaho-big-data-plugin/hadoop-configurations directory. Shim directory names consist of a three or four-letter Hadoop Distribution abbreviation followed by the Hadoop Distribution's version number. The version number does not contain a decimal point. For example, the shim directory named cdh512 is the shim for the CDH (Cloudera Distribution for Hadoop), version 5.12. Here is a list of the shim directory abbreviations.

AbbreviationShim
cdhCloudera's Distribution of Apache Hadoop
emrAmazon Elastic Map Reduce
hdiMicrosoft Azure HDInsight
hdpHortonworks Data Platform
maprMapR

Step 2: Select the correct shim

Although Pentaho often supports one or more versions of a Hadoop distribution, the download of the Pentaho Suite only contains the latest, supported, Pentaho-certified version of the shim. The other supported versions of shims can be downloaded from the Pentaho Customer Support Portal.
NoteBefore you begin, verify that the shim you want is supported by your version of Pentaho shown in the Components Reference.

Procedure

  1. Navigate to the pentaho-big-data-plugin/hadoop-configurations directory to view the shim directories.

    If the shim you want to use is already there, you can go to Step 3: Copy the configuration files from cluster to shim.
  2. On the Customer Portal home page, sign in using the Pentaho support user name and password provided to you in your Pentaho Welcome Packet.

  3. In the search box, enter the name of the shim you want, then select the shim from the search results.

    (Optional) You can browse the shims by version on the Downloads page.
  4. Read all prerequisites, warnings, and instructions.

  5. On the bottom of the page in the Box widget, click the shim ZIP file to download it.

  6. Unzip the downloaded shim package into the pentaho-big-data-plugin/hadoop-configurations directory.

Step 3: Copy the configuration files from cluster to shim

Copying configuration files from the cluster to the shim helps keep key configuration settings in sync with the cluster and reduces troubleshooting errors.

Procedure

  1. Back up the existing MapR shim files in the pentaho-big-data-plugin/hadoop-configurations/maprxx directory.

  2. Copy the following configuration files from the MapR cluster to pentaho-big-data-plugin/hadoop-configurations/maprxx.

    You should overwrite the existing files.
    • hbase-site.xml
    • hdfs-site.xml
    • hive-site.xml
  3. Copy the following configuration files from the MapR cluster to the Hadoop directory under the MapR Client installed on your computer.

    NoteThe Winows path to the MapR client is usually C:\opt\mapr\hadoop\hadoop-2.x.x\etc\hadoop. In Linux the path is usually /opt/mapr/hadoop/hadoop-2.x.x/etc/hadoop.
    • core-site.xml
    • mapred-site.xml
    • yarn-site.xml

Step 4: Edit the shim configuration files

You need to verify or change settings in authentication, Oozie, Hive, MapReduce, and YARN in these shim configuration files:

  • config.properties
  • mapred-site.xml
  • yarn-site.xml

Edit configuration properties (Windows)

If you are connecting to an unsecured cluster (default), verify that these values are properly set.

Procedure

  1. Navigate to the pentaho-big-data-plugin/hadoop-configurations/maprxx directory and open the config.properties file.

  2. Add the following values:

    ParameterValue
    windows.classpath​This value should match your local MapR client tools installation directory. Set the windows.classpath parameter equal to these:
    • Hadoop classpath
    • Pentaho installation directory path
    • MapR shim directory path
    NoteThe MapR shim might fail to load correctly if the drive letter in the Windows classpath or library path has a capital letter. This is a known issue with MapR software. If this happens, use the lower case instead, like this: file:///c:/opt/mapr.
    NoteThe value of windows.classpath parameter should include lib/hadoop2-windows-patch-08072014.jar as a first entry in the string, the Hadoop classpath of MapR client on the current machine, a full directory path where MapR shim is located under each Pentaho component, and this entry: file:///c:/opt/mapr/lib. To determine your hadoop classpath, execute the hadoop classpath command and use those values instead. Convert any directory paths to Windows URL format. The following is an example:
    windows.classpath=lib/hadoop2-windows-patch-08072014.jar,file:///C:/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop,file:///C:/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop,file:///C:/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/common/lib,file:///C:/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/common,file:///C://opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/hdfs,file:///C:/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/hdfs/lib,file:///C:/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/lib,file:///C:/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn,file:///C:/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/mapreduce/lib,file:///C:/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/mapreduce,file:///C:/opt/mapr/sqoop/sqoop-1.4.5,file:///C:/opt/mapr/sqoop/sqoop-1.4.5/lib,file:///C:/contrib/capacity-scheduler,file:///C:/opt/Pentaho/design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/mapr401,file:///C:/opt/Pentaho/design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/mapr401/lib,file:///C:​/opt/mapr/lib​
    windows.library.path
    windows.library.path=C:\\opt\\mapr\\lib
    pentaho.oozie.proxy.userYou do not need to verify this unless you plan to access the Oozie service through a proxy. If so, add the proxy user's name here.
  3. Save and close the file.

Edit configuration properties (Linux)

To configure the config.properties file, do these things.

Procedure

  1. Navigate to the pentaho-big-data-plugin/hadoop-configurations/maprxx directory and open the config.properties file.

  2. Add the following values:

    ParameterValue
    linux.classpathEdit this value to match your local MapR client tools installation directory. Set the linux.classpath parameter equal to these:
    • Hadoop classpath
    • Pentaho installation directory path
    • MapR shim directory path

    The linux.classpath should contain the Hadoop classpath of MapR client on the current machine, a full directory path where MapR shim is located under each Pentaho component, and this entry: /opt/mapr/lib. To determine your hadoop classpath, execute the hadoop classpath command and use those values instead. The following is an example:

    linux.classpath=/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop,/opt/mapr/hadoop/hadoop-2.4.1/etc/hadoop,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/common/lib,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/common,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/hdfs,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/hdfs/lib,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/lib,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/mapreduce/lib,/opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/mapreduce,/opt/mapr/sqoop/sqoop-1.4.5,/opt/mapr/sqoop/sqoop-1.4.5/lib,/contrib/capacity-scheduler,/opt/Pentaho/design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/mapr401,/opt/Pentaho/design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/mapr401/lib,/opt/mapr/lib​
    linux.library.path
    linux.library.path=/opt/mapr/lib
    pentaho.oozie.proxy.userYou do not need to verify this unless you plan to access the Oozie service through a proxy. If so, add the proxy user's name here.
  3. Save and close the file.

Edit Mapred site XML file

Edit the mapred-site.xml file to indicate where the job history logs are stored and to allow MapReduce jobs to run across platforms.

Procedure

  1. Navigate to the Hadoop directory in your MapR Client and open the mapred-site.xml file.

  2. Add the following values:

    ParameterValue
    mapreduce.jobhistory.addressSet this to the folder where you want to store the job history logs.
    mapreduce.app-submission.cross-platformAdd this property to allow MapReduce jobs to run on either Windows client or Linux server platforms:
    <property>
      <name>mapreduce.app-submission.cross-platform</name>
      <value>true</value>
    </property>
  3. Save and close the file.

Edit YARN site XML file

Verify that the following parameters are set in the yarn-site.xml file:

Procedure

  1. Navigate to the Hadoop directory in your MapR Client and open the yarn-site.xml file.

  2. Add the following values:

    ParameterValue
    yarn.application.classpath
    <property>
    <name>yarn.application.classpath</name>
    <value>$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*
    :$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*
    :$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*
    :$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*
    :/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$PWD/*:%PWD%/*
    </value>
    </property>
    yarn.resourcemanager.hostnameChange to the hostname of the resource manager in your environment.
    yarn.resourcemanager.addressChange to the hostname and port for your environment.
    yarn.resourcemanager.admin.addressChange to the hostname and port for your environment.
  3. Save and close the file.

Step 5: Set MAPR_HOME

Set the MAPR_HOME environment variable to the installation location of the MapR client, then restart your computer.

Connect to a Hadoop cluster with the PDI client

Once you have set up your shim, you must make it active, then configure and test the connection to the cluster. For details on setting up the connection, see the article Connect to a Hadoop Cluster with the PDI Client.

Connect other Pentaho components to the MapR cluster

These instructions explain how to create and test a connection to the cluster in the Pentaho Server, PRD, and PME. Creating and testing a connection to the cluster in the PDI client involves two tasks:

  • Setting the active shim on PRD, PME, and the Pentaho Server
  • Configuring and testing the cluster connections.

Set the Active Shim on PRD, PME, and Pentaho Server

Modify the plugin.properties file to set the active shim for the Pentaho Server, PRD, and PME.

Procedure

  1. Stop the component.

  2. Locate the pentaho-big-data-plugin directory for your component.

  3. Navigate to the hadoop-configurations directory.

  4. Navigate to the pentaho-big-data-plugin directory and open the plugin.properties file.

  5. Set the active.hadoop.configuration property to the directory name of the shim you want to make active. Here is an example:

    active.hadoop.configuation=mpr410
  6. Save and close the plugin.properties file.

  7. Restart the component.

Create and test connections

Connection tests appear in the following table:

ComponentTest
Pentaho Server for DICreate a transformation in the PDI client and run it remotely.
Pentaho Server for BACreate a connection to the cluster in the Data Source Wizard.
PMECreate a connection to the cluster in PME.
PRDCreate a connection to the cluster in PRD.

Once you have connected to the cluster and its services properly, provide connection information to users who need access to the cluster and its services. Those users can only obtain access from computers that have been properly configured to connect to the cluster.

These users need the following information to connect:

  • Hadoop distribution and version of the cluster
  • HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
  • Oozie URL (if used)
  • Users also require the appropriate permissions to access the directories they need on HDFS. This typically includes their home directory and any other required directories.

They might also need more information depending on the job entries, transformation steps, and services they use. See the Hadoop connection and access information list for a more detailed list of information that your users might need from you.

Notes

The following are special topics for MapR.

Drive letter casing issue (Windows)

The MapR shim might fail to load correctly if the drive letter in the Windows classpath or library path has a capital letter. This is a known issue with MapR software. If this happens, use the lower case instead, like this: file:///c:/opt/mapr.

MapR 6.0 notes

The following notes address issues with MapR 6.0

Use MapR 6.0 with HBase

Perform the following steps to use HBase with the MapR 6.0 shim:

Procedure

  1. Close all the Pentaho products and stop all HBase services.

  2. On a MapR cluster, open the core-site.xml file.

  3. Add the following property and values in the core-site.xml file:

    <property>
    	<name>hbase.table.namespace.mappings</name>
    	<value>*://hbase</value>
    </property>
    
  4. Save and close the core-site.xml file.

  5. Copy the core-site.xml file to every cluster node.

  6. Copy the core-site.xml file from the MapR cluster to the Hadoop directory under the MapR Client installed on your computer.

    NoteThe default Windows path to the MapR client is C:\opt\mapr\hadoop\hadoop-2.x.x\etc\hadoop. The default Linux path is /opt/mapr/hadoop/hadoop-2.x.x/etc/hadoop.
  7. Restart all HBase services and Pentaho products.

Use MapR with MapR-DB

An additional setting is required to use MapR with MapR-DB. See the MapR Mapping to HBase Table Namespaces documentation for more information about this setting. Due to MapR limitations, Hbase comparators are not supported.

MapR 4.1 notes

The following notes address issues with MapR 4.1.

Impala support note

Pentaho does not support connections to Impala on a secured MapR 4.1 cluster. For more information, please see these references:

For troubleshooting cluster and service configuration Issues, refer to Big Data issues.