Skip to main content
Pentaho Documentation

Use Kerberos Authentication to Provide Spoon Users Access to Hadoop Cluster

If you use Kerberos to authenticate access to your Hadoop cluster, with a little extra configuration, you can also use Kerberos to authenticate Spoon users who attempt the access the cluster through a step in the transformation. When a user attempts to run a transformation that contains a step that connects to a Hadoop cluster to perform a function, the user's account credential is matched against the credentials in the Kerberos administrative database on the Hadoop cluster.  If the credentials match, the Kerberos Key Distribution Center (KDC) grants an authorization ticket and access is granted. If not, the user is not authentication and the step does not run.

To set up Kerberos authentication to provide Spoon users with access to the Hadoop cluster, you will need to perform several sets of tasks.

Complete Cluster and Client-Node Prerequisites

Make sure that you have completed the following tasks before you move to the next section.

  • Install a Hadoop cluster on one or more Linux servers. The cluster should be running one of the versions of Hadoop listed in the Configuring Pentaho for your Hadoop Distro and Version section of the Pentaho Big Data wiki.
  • Configure the Hadoop cluster with a Kerberos Realm, Kerberos KDC, and Kerberos Administrative Server.
  • Make sure the Hadoop cluster, including the name node, data nodes, secondary name node, job tracker, and task tracker nodes have been configured to accept remote connection requests.
  • Make sure the Kerberos clients have been set up for all data, task tracker, name, and job tracker nodes if you are have deployed Hadoop using an enterprise-level program.
  • Install the current version of Spoon on each client machine.
  • Make sure each client machine can use a hostname to access the Hadoop cluster. You should also test to ensure that IP addresses resolve to hostnames using both forward and reverse lookups.

Add Users to Kerberos Database on Hadoop Cluster

Add the user account credential for each Spoon user that should have access to the Hadoop cluster to the Kerberos database. You only need to do this once.

  1. Log in as root (or a privileged user), to the server that hosts the Kerberos database.
  2. Make sure there is an operating system user account on each node in the Hadoop cluster for each user that you want to add to the Kerberos database. Add operating system user accounts if necessary. Note that the user account UIDs must be greater than the minimum user ID value (min.user.id). Usually, the minimum user ID value is set to 1000.
  3. Add user identification to the Kerberos database by completing these steps.
    1. Open a Terminal window, then add the account username to the Kerberos database, like this. The name should match the operating system user account that you verified (or added) in the previous step. If successful, a message appears indicating that the user has been created.
      root@kdc1:~# kadmin.local -q "addprinc <username>"
      ...
      Principal "<user name>@DEV.LOCAL" created.
    2. Repeat for each user you want to add to the database.

Set Up Kerberos Administrative Server and KDC to Start When Server Starts

It is a good practice to start the Kerberos Administrative Server and the KDC when the server boots. One way to do this is to set them up to run as a service. This is an optional, but recommended step.

  1. If you have not done so already, log into the server that contains the Kerberos Administrative Server and the KDC.
  2. Set the Kerberos Administrative Server to run as a service when the system starts. By default, the name of the Kerberos Administrative Server is kadmin. If you do not know how to do this, check the documentation for your operating system.
  3. Set the KDC to run as a service when the system starts. By default, the name of the KDC is krb5kdc.

Make CDH-Specific Cluster Side Configurations

If you are using CDH 5.4, you will need to make one additional change.  By default, oozie jobs are run by the oozie user.  But, if you use PDI to start an oozie job, you will need to add the PDI user to the oozie-site.xml file so that the PDI user can execute the program in proxy.  To do that, add the following two lines of the code to the oozie-site.xml file on cluster, substituting <your_pdi_user_name> with the PDI User username, such as jdoe.

<property>
<name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.group</name>
<value>*</value>
</property>
<property>
<name>oozie.service.ProxyUserService.proxyuser.<your_pdi_user_name>.host</name>
<value>*</value>
</property>

Make HDP-Specific Cluster Side Configurations

If you are using HDP, you might require additional configuration changes depending on your environment.  Before you complete the steps in this section, review the installation and configuration documentation at the HDP documentation website: http://docs.hortonworks.com/.

The instructions in this section are for configuring a test server only.  Adjust the following instructions to meet the needs of your HDP test and production environments.

HDP-specific configuration changes are divided into two groups.

  • Edit HDP Cluster-Side Configuration Files
  • Set Access Permissions

Edit HDP Cluster-Side Configuration Files

Three files should be edited on the cluster: core-site.xml, hdfs-site.xml, and oozie.xml.

Edit core-site.xml

Edit the core-site.xml file to specify the appropriate hosts and groups for various proxy users.  To edit the file do the

  1. On the cluster, open the core-site.xml file in a text editor.  By default, it is in the $HADOOP_CONF_DIR.
  2. Set the host and group for proxyusers using the following example as a guide. 
  • hadoop.proxyuser.{username of proxyuser such as hive or oozie)}.hosts=FQHN of the hadoop manager node (by default)
  • hadoop.proxyuser.{username of proxyuser such as hive or oozie)}.groups=users(by default)

Here is an example.  Modify the samplehost, samplecompany, and samplegroup values to match your environment.  Note that not all of the properties might be needed.  Also, note that the kinit_user should already be added to any groups that you specify.

<property>
  <name>hadoop.proxyuser.root.hosts</name>
  <value>samplehost.samplecompany.com</value>
</property>
<property>
  <name>hadoop.proxyuser.falcon.hosts</name>
  <value>samplehost.samplecompany.com</value>
</property>
<property>
   <name>hadoop.proxyuser.hive.hosts</name>
   <value>samplehost.samplecompany.com</value>
</property>
<property>
   <name>hadoop.proxyuser.HTTP.hosts</name>
   <value>samplehost.samplecompany.com</value>
</property>
<property>
   <name>hadoop.proxyuser.oozie.hosts</name>
   <value>samplehost.samplecompany.com</value>
</property>
<property>
   <name>hadoop.proxyuser.hcat.hosts</name>
   <value>samplehost.samplecompany.com</value>
</property>
<property>
   <name>hadoop.proxyuser.root.groups</name>
   <value>samplegroup</value>
</property>
<property>
   <name>hadoop.proxyuser.oozie.groups</name>
   <value>samplegroup</value>
</property>
<property>
   <name>hadoop.proxyuser.HTTP.groups</name>
   <value>samplegroup</value>
</property>
<property>
   <name>hadoop.proxyuser.falcon.groups</name>
   <value>samplegroup</value>
</property>
<property>
   <name>hadoop.proxyuser.hcat.groups</name>
   <value>samplegroup</value>
</property>
<property>
   <name>hadoop.proxyuser.hive.groups</name>
   <value>samplegroup</value>
</property>
  1. Save and close the file.
Edit hdfs-site.xml

To edit the hdfs-site.xml file, complete these steps.

  1. On the cluster, open the hdfs-site.xml file with a text editor.  By default it is in the $HADOOP_CONF_DIR.
  2. Set the dfs.nfs.exports.allowed.hosts property's value to read and write, like this (note the space between the asterisk "*" and the "rw" in the following code snippet):
<property>
      <name>dfs.nfs.exports.allowed.hosts</name>
      <value>* rw</value>
    </property>
  1. Save and close the file.
Edit oozie-site.xml

To edit the oozie-site.xml file, complete these steps.

  1. On the cluster, open the oozie-site.xml file in a text editor.  By default, it is located in the $OOZIE_CONF_DIR.
  2. Set the hosts and groups for proxyusers using the following example as a guide. 
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>samplehost.samplecompany.com</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.group</name>
<value>samplegroup</value>
</property>
<property>
<name>hadoop.proxyuser.{kinit_user}.hosts</name>
<value>samplehost.samplecompany.com</value>
</property>
<property>
<name>hadoop.proxyuser.{kinit_user}.group</name>
<value>samplegroup</value>
</property>
<property>
oozie.service.ProxyUserService.proxyuser.{kinit_user}.hosts</name>
      <value>samplehost.samplecompany.com</value>
    </property>
<property>
oozie.service.ProxyUserService.proxyuser.{kinit_user}.group</name>
      <value>samplegroup</value>
    </property>
  1. Save, then close the file.
Set Access Permissions

Access permissions must be granted to allow the knit_user access to the appropriate file systems.  The examples in this section are specific to test servers only.  Increase security by modifying the instructions and parameters are required for your organization.

Grant kinit_user Access to Hadoop File System

Ensure that the kinit_user has permission to access the hadoop file system and any other directories where access is required.

In a terminal window or shell tool, enter a chmod command to grant the kinit_user access to the appropriate directories, like this:

hadoop fs -chmod -R 777 /user/{kinit_user}

To make this more secure, use a different value than 777.

Edit YARN ACL, Logs, and Timeline Server Settings

To edit this settings, complete these steps.

  1. In the YARN resource manager, set the yarn.acle.enable to either true or false, as needed. 
  2. Set the yarn.admin.acl to *
  3. Set yarn-log-aggregation-enable to false.
  4. Set yarn.timeline-service.enabled to false.
  5. Save and close the resource manager file.
Grant Privileges to Hive Mysql database

In a shell or terminal window on the Hadoop cluster, enter the following.

mysql
grant all privileges on *.* to 'hive'@'ip of the kettle client host' with grant option; //this should be done for all client hosts or could be specified '%' to allow all this connection from all hosts
exit;
hive
GRANT ALL ON DATABASE default TO USER {kinit_user} WITH GRANT OPTION; //instead of 'default' could be other databases
Edit the Hive-Site.xml to Grant Owner Authorization For Tables

To edit the hive-site.xml file, do these things.

  1. Open the hive-site.xml file in a text editor.
  2. Set the hive.security.authorization.createable.owner.grants property to ALL.
  3. Save and close the file.
Grant HBASE privileges

To grant HBASE privileges, in a shell or terminal window, type the following:

hbase shell
grant '{kinit_user}','RWXCA'
Configure Spoon Client-Side Nodes
After you have added users to the database and configured the Kerberos admin and KDC to start when the server starts, you are ready to configure each client-side node from which a user might access the Hadoop cluster. Client-side nodes should each have a copy of Spoon already installed. Client-side configuration differs based on your operating system.
Configure Linux and Mac Client Nodes
Install JCE on Linux and Mac Clients
This step is optional. The KDC configuration includes an AES-256 encryption setting. If you want to use this encryption strength, you will need to install the Java Cryptographic Extension (JCE) files.
  1. Download the Java Cryptographic Extension (JCE) for the currently supported version of Java from the Oracle site.
  2. Read the installation instructions that are included with the download.
  3. Copy the JCE jars to the java/lib/security directory where PDI is installed on the Linux client machine.
Configure PDI for Hadoop Distribution and Version on Linux and Mac Clients

To configure DI to connect to the Hadoop cluster, you'll need to copy Hadoop configuration files from the cluster's name node to the appropriate place in the hadoop-configurations subdirectory.

  1. Back up the core-site.xml, hdfs-site.xml, and mapred-site.xml files that are in the design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/<directory of the shim that is in your plugin.properties file>.
  2. Copy the core-site.xml, hdfs-site.xml, and mapred-site.xml from the cluster's name node to this directory on each client: design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/<directory of the shim that is in your plugin.properties file>.

If you made configuration changes to the core-site.xml, hdfs-site.xml, or mapred-site.xml files previously, you will need to make those changes again. Reference your backed up copies of the files if needed.

Download and Install Kerberos Client on Linux and Mac Clients
Download and install a Kerberos client. Check your operating system's documentation for further details on how to do this.
Modify Kerberos Configuration File to Reflect Realm, KDC, and Admin Server on Linux and Mac Clients

Modify the Kerberos configuration file to reflect your Realm, KDC, and Admin Server.

  1. Open the krb5.conf file. By default this file is located in /etc/krb5.conf, but it might appear somewhere else on your system.
  2. Add your Realm, KDC, and Admin Server information. The information in-between the carats < > indicates where you should modify the code to match your specific environment.
    [libdefaults]
           default_realm = <correct default realm name>
    	clockskew = 300
    	v4_instance_resolve = false
    	v4_name_convert = {
    		host = {
    			rcmd = host
    			ftp = ftp
    		}
    		plain = {
    			something = something-else
    		}
    	}
    	
    [realms]
    	<correct default realm name>= {
    		kdc=<KDC IP Address, or resolvable Hostname>
    		admin_server=< Admin Server IP Address, or resolvable Hostname>
    	}
    	MY.REALM = {
    		kdc = MY.COMPUTER 
    	}
    	OTHER.REALM = {
    		v4_instance_convert = {
    			kerberos = kerberos
    			computer = computer.some.other.domain
    		}
    	}
    [domain_realm]
    	.my.domain = MY.REALM 
  3. Save and close the configuration file.
  4. Restart the computer.
Synchronize Clock on Linux and Mac Clients

Synchronize the clock on the Linux client with the clock on the Hadoop cluster. This is important because if the clocks are too far apart, then when authentication is attempted, Kerberos will not consider the tickets that are granted to be valid and the user will not be authenticated. The times on the Linux client clock and the Hadoop cluster clock must not be greater than the range you entered for the clockskew variable in krb5.conf file when you completed the steps in the Modify Kerberos Configuration File to Reflect Realm, KDC, and Admin Server on Linux Client task.

Consult your operating system's documentation for information on how to properly set your clock.

Obtain Kerberos Ticket on Linux and Mac Clients

To obtain a Kerberos ticket, complete these steps.

  1. Open a Terminal window and type kinit at the prompt.
  2. When prompted for a password, enter it.
  3. The prompt appears again. To ensure that the Kerberos ticket was granted, type klist at the prompt.
  4. Authentication information appears.
Configure Windows Client Nodes
Install JCE on Windows Client
This step is optional. The KDC configuration includes an AES-256 encryption setting. If you want to use this encryption strength, you will need to install the Java Cryptographic Extension (JCE) files.
  1. Download the Java Cryptographic Extension (JCE) for the currently supported version of Java from the Oracle site.
  2. Read the installation instructions that are included with the download.
  3. Copy the JCE jars to the java\lib\security directory where PDI is installed.
Configure PDI for Hadoop Distribution and Version on Windows Client

To configure PDI to connect to the Hadoop cluster, you'll need to copy Hadoop configuration files from the cluster's name node to the appropriate place in the hadoop-configurations subdirectory.

  1. Back up the core-site.xml, hdfs-site.xml, and mapred-site.xml files that are in the design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/<directory of the shim that is in your plugin.properties file>.
  2. Copy the core-site.xml, hdfs-site.xml, and mapred-site.xml from the cluster's name node to this directory on each client: design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/<directory of the shim that is in your plugin.properties file>.

If you made configuration changes to the core-site.xml, hdfs-site.xml, or mapred-site.xml files previously, you will need to make those changes again. Reference your backed up copies of the files if necessary.

Download and Install Kerberos Client on Windows Client

Download and install a Kerberos client. We recommend that you use the Heimdal implementation of Kerberos, which can be found here: https://www.secure-endpoints.com/heimdal/.

Modify Kerberos Configuration File to Reflect Realm, KDC, and Admin Server on Windows Client
You will need to modify the Kerberos configuration file to reflect the appropriate realm, KDC, and Admin Server.
  1. Open the krb5.conf file. By default this file is located in c:\ProgramData\Kerberos. This location might be different on your system.
  2. Add the appropriate realm, KDC, and Admin Server information. An example of where to add the data appears below.
    [libdefaults]
           default_realm = <correct default realm name>
    	clockskew = 300
    	v4_instance_resolve = false
    	v4_name_convert = {
    		host = {
    			rcmd = host
    			ftp = ftp
    		}
    		plain = {
    			something = something-else
    		}
    	}
    	
    [realms]
    	<correct default realm name>= {
    		kdc=<KDC IP Address, or resolvable Hostname>
    		admin_server=< Admin Server IP Address, or resolvable Hostname>
    	}
    	MY.REALM = {
    		kdc = MY.COMPUTER 
    	}
    	OTHER.REALM = {
    		v4_instance_convert = {
    			kerberos = kerberos
    			computer = computer.some.other.domain
    		}
    	}
    [domain_realm]
    	.my.domain = MY.REALM 
  3. Save and close the configuration file.
  4. Make a copy of the configuration file and place it in the c:\Windows directory. Rename the file krb5.ini.
  5. Restart the computer.
Synchronize Clock on Windows Client

Synchronize the clock on the Windows client with the clock on the Hadoop cluster. This is important because if the clocks are too far apart, then when authentication is attempted, Kerberos will not consider the tickets that are granted to be valid and the user will not be authenticated. The times on the Windows client clock and the Hadoop cluster clock must not be greater than the range you entered for the clockskew variable in krb5.conf file when you completed the steps in the Modify Kerberos Configuration File to Reflect Realm, KDC, and Admin Server on Windows Client task.

Consult your operating system's documentation for information on how to properly set your clock.

Obtain Kerberos Ticket on Windows Client

To obtain a Kerberos ticket, complete these steps.

  1. Open a Command Prompt window and type kinit at the prompt.
  2. When prompted for a password, enter it.
  3. The prompt appears again. To ensure that the Kerberos ticket was granted, type klist at the prompt.
  4. Authentication information appears.
Make HDP-Specific Client Side Configurations

If you are using HDP 2.2 , you will need to make a few additional configuration changes. 

Configure PDI for Hadoop Distribution and Version on  Windows, Linux and Mac Clients

Complete these steps.

  1. Back up the hbase-site.xml file that is in the design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/<directory of the shim that is in your plugin.properties file>.
  2. Copy hbase-site.xml from the cluster's name node to this directory on each client: design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/<directory of the shim that is in your plugin.properties file>.

If you made configuration changes to the hbase-site.xml file previously, you will need to make those changes again. Reference your backed up copies of the files if necessary.

Edit HDP Configuration Properties File 

To edit the HDP Configuration Properties File, complete these steps.

  1. On the client, open the config.properties file for HDP in a text editor.
  2. Add the following line to the configuration file:
java.system.hdp.version=2.2.0.0-2041
  1. Save and close the file.
Edit hbase-site.xml

To edit the hbase-site.xml file, complete these steps.

  1. On the client, open the hbase-site.xml file in a text editor.  
  2. Delete the hbase.temp.dir property.
  3. Save, then close the file.
Test Authentication from Within Spoon

To test the authentication from within Spoon, run a transformation that contains a step that connects to a Hadoop cluster. For these instructions to work properly, you should have read and write access to the your home directory on the Hadoop cluster.

  1. Start Spoon.
  2. Open an existing transformation that contains a step to connect to the Hadoop cluster. If you don't have one, consider creating something like this.
    1. Create a new transformation.
    2. Drag the Generate Rows step to the canvas, open the step, indicate a limit (the number of rows you want to generate), then put in field information, such as the name of the field, type, and a value.
    3. Click Preview to ensure that data generates, then click the Close button to save the step.
    4. Drag a Hadoop File Output step onto the canvas, then draw a hop between the Generate Rows and Hadoop File Output steps.
    5. In the Filename field, indicate the path to the file that will contain the output of the Generate Rows step. The path should be on the Hadoop cluster. Make sure that you indicate an extension such as txt and that you want to create a parent directory and that you want to add filenames to the result.
    6. Click the OK button then save the transformation.
  3. Run the transformation. If there are errors correct them.
  4. When complete, open a Terminal window and view the results of the output file on the Hadoop filesystem. For example, if you saved your file to a file named test.txt, you could type a command like this:
    hadoop fs -cat /user/pentaho-user/test/test.txt