Connect to Azure HDInsight

Last updated
Save as PDF

This article explains how to connect the Pentaho Server to a Microsoft Azure HDInsight cluster. Pentaho supports both the HDFS file system and the WASB (Windows Azure Storage BLOB) extension for Azure HDInsight. The WASB file system is the default file system for Azure HDInsight.

The following tasks make up the process for connecting an Azure HDInsight Hadoop cluster to the Pentaho Server:

After creating a connection, we suggest that you test it. If you are not able to connect, refer to the Troubleshooting section. To ensure a good connection, we recommend performing a few tasks before you begin the connection process.

Before you begin

Before you begin, you will need to perform the following tasks:

Procedure

Check the Components Reference to verify that your Pentaho version supports your version of the Azure HDInsight cluster.
Set up a Azure HDInsight cluster.
Pentaho can connect to secured and unsecured Azure HDInsight clusters:
1. Configure an Azure HDInsight cluster.
  See the Microsoft Azure documentation if you need help.
2. Install any required services and service client tools.
3. Test the cluster.
Ask your Hadoop Administrator for connection information to the cluster and related services.
This information may also be available in your cluster management tools or Ambari.
Add the YARN user on the cluster to the group defined by dfs.permissions.superusergroup property.
This property is located in the hdfs-site.xml file on your cluster or in the cluster management application.

Edit cluster configuration files

Although Pentaho often supports more than one version of a Hadoop distribution, the shim for Azure HDInsight is not included in the Pentaho Suite download, and must be downloaded separately from the Pentaho Customer Support Portal. You will need to sign in to view the available downloads.

The Oozie user runs Oozie jobs by default. But if you use PDI to start an Oozie job, you must add the PDI user to the oozie-site.xml file on the cluster so that the PDI user can execute the program by proxy. To use the Oozie service complete the following instructions:

Procedure

Open the oozie-site.xml file on the cluster.

Add the following lines of the code to the oozie-site.xml file on cluster, substituting <your_pdi_user_name> with the PDI user name, such as jdoe.

<property>
    <name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.groups</name>
        <value>*</value>
</property>
<property>
    <name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.hosts</name>
        <value>*</value>
</property>

Save and close the file.

Create or edit an existing job.properties file to point to your hostnames and folders in the cluster.

Here is an example:

nameNode=wasb://<Your server name>@eastorageacct2.blob.core.windows.net/ jobTracker=hn1-pentah.trhf3tzg3kne3osozhcc4hsv1h.cx.internal.cloudapp.net:8050 queueName=default examplesRoot=examples
oozie.wf.application.path=<Cluster folder name>/oozie/examples/apps/map-reduce outputDir=<Your working directory>/oozie/output

Configure Pentaho component shims

You must configure the shim in each of the following Pentaho components, on each computer from which Pentaho will be used to connect to the cluster:

PDI client (Spoon)
Pentaho Server, including Analyzer and Pentaho Interactive Reports.

As a best practice, configure the shim in the PDI client first. The PDI client has features that will help you test your configuration. Then copy the tested PDI client configuration files to other components, making changes if necessary.

You can also opt to go through these instructions for each Pentaho component, and not copy the shim files from the PDI client. If you do not plan to connect to the cluster from the PDI client, you can configure the shim in another component first instead.

Step 1: Locate the Pentaho Big Data plugin and shim directories

Shims and other parts of the Pentaho Adaptive Big Data Layer are in the Pentaho Big Data Plugin directory. The path to this directory differs by component. You need to know the locations of this directory, in each component, to complete shim configuration and testing tasks.

In the following table, <pentaho home> in the shim locations for each component is the directory where Pentaho is installed:

Components	Location of Pentaho Big Data Plugin Directory
PDI client	`<pentaho home>`/design-tools/data-integration/plugins/pentaho-big-data-plugin
Pentaho Server	`<pentaho home>`/server/pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin

Shims are located in the pentaho-big-data-plugin/hadoop-configurations directory. Shim directory names consist of a three or four-letter Hadoop Distribution abbreviation followed by the Hadoop Distribution's version number. The version number does not contain a decimal point. For example, the shim directory named cdh59 is the shim for the CDH (Cloudera Distribution for Hadoop), version 5.9. The following table lists the shim directory abbreviations for Hadoop distributions:

Abbreviation	Shim
cdh	Cloudera's Distribution of Apache Hadoop
emr	Amazon Elastic Map Reduce
hdi	Microsoft Azure HDInsight
hdp	Hortonworks Data Platform
mapr	MapR

NoteYou will not see the hdi directory until you have unpacked the download as outlined in Step 2.

Step 2: Select the correct shim

Although Pentaho often supports one or more versions of a Hadoop distribution, the download of the Pentaho Suite only contains the latest, supported, Pentaho-certified version of the shim. The other supported versions of shims can be downloaded from the Pentaho Customer Support Portal.

Before you begin, verify that the shim you want is supported by your version of Pentaho shown in the Components Reference.

Procedure

If you have not downloaded the Azure HDInsight shim, go to the Customer Portal. Sign in using the Pentaho support user name and password provided to you in your Pentaho Welcome Packet.
In the search box, enter the name of the shim you want. Select the shim from the search results. Optionally, you can browse the shims by version on the Downloads page.
Read all prerequisites, warnings, and instructions. On the bottom of the page in the Box widget, click the shim ZIP file to download it.
Navigate to the pentaho-big-data-plugin/hadoop-configurations directory to view the shim directories.
Unzip the downloaded shim package into the pentaho-big-data-plugin/hadoop-configurations directory.

Step 3: Copy the configuration files from cluster to shim

Copying configuration files from the cluster to the shim helps keep key configuration settings in sync with the cluster and reduces configuration errors.

Perform the following steps to copy these configuration file from the cluster to the shim:

Procedure

Back up the existing HDI shim files in the pentaho-big-data-plugin/hadoop-configurations/hdixx directory.
Copy the following configuration files from the HDI cluster to pentaho-big-data-plugin/hadoop-configurations/hdixx (overwriting the existing files):
- core-site.xml
- hbase-site.xml
- hdfs-site.xml
- hive-site.xml
- mapred-site.xml
- yarn-site.xml

Step 4: edit the shim configuration files

You need to verify or change authentication, Oozie, Hive, MapReduce, and YARN settings in the following files:

core-site.xml
config.properties
hbase-site.xml
hive-site.xml
mapred-site.xml
yarn-site.xml

Verify or edit configuration properties

To connect to a cluster, perform the following steps to verify that the proxy user values are properly set.

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/hdixx directory and open the config.properties file.

Add the following values:

Parameter	Values
authentication.superuser.provider	`NO_AUTH`
pentaho.oozie.proxy.user	Add a proxy user's name to access the Oozie service through a proxy, otherwise, leave it set to oozie.
java.system.hdp.version	HDI Version. For HDP 2.2, this is 2.2.0.0-2041

Save and close the file.

Edit Core site XML file

To use WASB storage, perform the following steps for updating the core-site.xml file:

Procedure

Obtain an unencrypted key from the Azure HDInsight cluster.

Set the following properties in the core-site.xml file:

Parameter	Values
fs.AbstractFileSystem.wasb.impl	<property> <name>fs.AbstractFileSystem.wasb.impl</name> <value>org.apache.hadoop.fs.azure.Wasb</value> </property>
fs.azure.account.key.eastorageacct2.blob.core.windows.net	<property> <name>fs.azure.account.key.eastorageacct2.blob.core.windows.net</name> <value>VR9p2ca4enpOrS2/CVOuwN/5+4eFS7nLjudXwFD21T5wA9yrtAuAJrnmoSbjRYPUSwh8d8HKEGPCu Kzv4so99A==</value> </property>
fs.azure.account.keyprovider.eastorageacct2.blob.core.windows.net	<property> <name>fs.azure.account.keyprovider.eastorageacct2.blob.core.windows.net</name> <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value> </property>

Save and close the file.

Edit HBase site XML file

Edit the location of the temporary directory in the hbase-site.xml file to create an HBase local storage directory as follows:

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/hdixx directory and open the hbase-site.xml file.
Add the following value:

Parameter Value
hbase.tmp.dir /tmp/hadoop/hbase
Save and close the file.

Parameter	Value
hbase.tmp.dir	/tmp/hadoop/hbase

Edit Hive site XML file

Verify that the hive.metastore.uris parameter is set in the hive-site.xml file through the following steps:

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/hdixx directory and open the hive-site.xml file.
Add the following value:

Parameter Value
hive.metastore.uris Set this to the location of your hive metastore.
Save and close the file.

Parameter	Value
hive.metastore.uris	Set this to the location of your hive metastore.

Edit HDFS site XML file

Set the dfs.internal.nameservices parameter value in the config.properties file through the following steps:

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/hdixx directory and open the config.properties file.
Add these values:

Parameter Value
dfs.internal.nameservices Set the value to your alias name for the HDFS name nodes.
Save and close the file.

Parameter	Value
dfs.internal.nameservices	Set the value to your alias name for the HDFS name nodes.

Edit Mapred site XML file

Edit the mapred-site.xml file to indicate where the job history logs are stored and to allow MapReduce jobs to run across platforms as follows:

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/hdixx directory and open the mapred-site.xml file.

Add the following values:

Parameter	Value
mapreduce.jobhistory.address	Set this to the directory where you want to store the job history logs.
mapreduce.application.classpath	Add classpath information. Here is an example: <property> <name>mapreduce.application.classpath</name> <value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/* :$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/* :$PWD/mr-framework/hadoop/share/hadoop/common/:$PWD/mr-framework/hadoop/share/hadoop/common/lib/ :$PWD/mr-framework/hadoop/share/hadoop/yarn/:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/ :$PWD/mr-framework/hadoop/share/hadoop/hdfs/:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/ :/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure </value> </property>
mapreduce.application.framework.path	Set the framework path. Here is an example: <property> <name>mapreduce.application.framework.path</name> <value>/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework</value> </property>
mapreduce.app-submission.cross-platform	Add this property to allow MapReduce jobs to run on either Windows client or Linux server platforms: <property> <name>mapreduce.app-submission.cross-platform</name> <value>true</value> </property>
mapreduce.jobhistory.webapp.address	<property> <name>mapreduce.jobhistory.webapp.address</name> <value>headnodehost:19888</value> </property>

Save and close the file.

Edit YARN site XML file

Verify that the following parameters are set in the yarn-site.xml file:

Procedure

Navigate to the pentaho-big-data-plugin/hadoop-configurations/hdixx directory and open the yarn-site.xml file.

Add the following values:

Parameter	Value
yarn.application.classpath	Add the classpaths needed to run YARN applications. Use commas to separate multiple paths. Here is an example: <property> <name>yarn.application.classpath</name> <value>$HADOOP_CONF_DIR,/usr/hdp/current/hadoop-client/, /usr/hdp/current/hadoop-client/lib/,/usr/hdp/current/hadoop-hdfs-client/, /usr/hdp/current/hadoop-hdfs-client/lib/,/usr/hdp/current/hadoop-yarn-client/, /usr/hdp/current/hadoop-yarn-client/lib/</value> </property>
yarn.resourcemanager.hostname	Update the hostname in your environment or use the default: sandbox.hortonworks.com
yarn.resourcemanager.address	Update the hostname and port for your environment.
yarn.resourcemanager.admin.address	Update the hostname and port for your environment.

Save and close the file.

Connect to a Hadoop cluster with the PDI client

Once you have set up your shim, you must make it active, then configure and test the connection to the cluster. For details on setting up the connection, see the article Connect to a Hadoop cluster with the PDI client.

Connect other Pentaho components to the Azure HDInsight cluster

These instructions explain how to create and test a connection to the cluster in the Pentaho Server. Creating and testing a connection to the other components involves two tasks:

Setting the active shim on the Pentaho Server
Configuring and testing the cluster connections

Set the active shim on the Pentaho Server

Modify the plugin.properties file to set the active shim for the Pentaho Server.

Procedure

Stop the component.
Locate the pentaho-big-data-plugin directory for your component.
Navigate to the hadoop-configurations directory.
Navigate to the pentaho-big-data-plugin directory and open the plugin.properties file.
Set the active.hadoop.configuration property to the directory name of the shim you want to make active. Here is an example:
```
active.hadoop.configuation=active.hadoop.configuation=hdi35
```
Save and close the plugin.properties file.
Restart the component.

Create and test connections

Connection tests appear in the following table:

Component	Test
Pentaho Server for DI	Create a transformation in the PDI client and run it remotely.
Pentaho Server for BA	Create a connection to the cluster in the Data Source Wizard.

Once you have connected to the cluster and its services properly, provide connection information to users who need access to the cluster and its services. Those users can only obtain access from computers that have been properly configured to connect to the cluster.

These users need the following information to connect:

Hadoop distribution and version of the cluster
HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
Oozie URL (if used)
Users also require the appropriate permissions to access the directories they need on HDFS. This typically includes their home directory and any other required directories.

They might also need more information depending on the job entries, transformation steps, and services they use. See Hadoop connection and access information list for a more detailed list of information that your users might need from you.

For troubleshooting cluster and service configuration Issues, refer to Big Data issues.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Before you begin

Edit cluster configuration files

Configure Pentaho component shims

Step 1: Locate the Pentaho Big Data plugin and shim directories

Step 2: Select the correct shim

Step 3: Copy the configuration files from cluster to shim

Step 4: edit the shim configuration files

Verify or edit configuration properties

Edit Core site XML file

Edit HBase site XML file

Edit Hive site XML file

Edit HDFS site XML file

Edit Mapred site XML file

Edit YARN site XML file

Connect to a Hadoop cluster with the PDI client

Connect other Pentaho components to the Azure HDInsight cluster

Set the active shim on the Pentaho Server

Create and test connections