Skip to main content
Hitachi VantaraPentaho Documentation
Pentaho Documentation

Use Kerberos Authentication to Provide Spoon Users Access to Hadoop Cluster

If you use Kerberos to authenticate access to your Hadoop cluster, with a little extra configuration, you can also use Kerberos to authenticate Spoon users who attempt the access the cluster through a step in the transformation. When a user attempts to run a transformation that contains a step that connects to a Hadoop cluster to perform a function, the user's account credential is matched against the credentials in the Kerberos administrative database on the Hadoop cluster.  If the credentials match, the Kerberos Key Distribution Center (KDC) grants an authorization ticket and access is granted. If not, the user is not authentication and the step does not run.

To set up Kerberos authentication to provide Spoon users with access to the Hadoop cluster, you will need to perform four sets of tasks.

Complete Cluster and Client-Node Prerequisites

Make sure that you have completed the following tasks before you move to the next section.

  • Install a Hadoop cluster on one or more Linux servers. The cluster should be running one of the versions of Hadoop listed in the Configuring Pentaho for your Hadoop Distro and Version section of the Pentaho Big Data wiki.
  • Configure the Hadoop cluster with a Kerberos Realm, Kerberos KDC, and Kerberos Administrative Server.
  • Make sure the Hadoop cluster, including the name node, data nodes, secondary name node, job tracker, and task tracker nodes have been configured to accept remote connection requests.
  • Make sure the Kerberos clients have been set up for all data, task tracker, name, and job tracker nodes if you are have deployed Hadoop using an enterprise-level program.
  • Install the current version of Spoon on each client machine.
  • Make sure each client machine can use a hostname to access the Hadoop cluster. You should also test to ensure that IP addresses resolve to hostnames using both forward and reverse lookups.

Add Users to Kerberos Database on Hadoop Cluster

Add the user account credential for each Spoon user that should have access to the Hadoop cluster to the Kerberos database. You only need to do this once.

  1. Log in as root (or a privileged user), to the server that hosts the Kerberos database.
  2. Make sure there is an operating system user account on each node in the Hadoop cluster for each user that you want to add to the Kerberos database. Add operating system user accounts if necessary. Note that the user account UIDs must be greater than the minimum user ID value (min.user.id). Usually, the minimum user ID value is set to 1000.
  3. Add user identification to the Kerberos database by completing these steps.
    1. Open a Terminal window, then add the account username to the Kerberos database, like this. The name should match the operating system user account that you verified (or added) in the previous step. If successful, a message appears indicating that the user has been created.
      root@kdc1:~# kadmin.local -q "addprinc <username>"
      ...
      Principal "<user name>@DEV.LOCAL" created.
    2. Repeat for each user you want to add to the database.

Set Up Kerberos Administrative Server and KDC to Start When Server Starts

It is a good practice to start the Kerberos Administrative Server and the KDC when the server boots. One way to do this is to set them up to run as a service. This is an optional, but recommended step.

  1. If you have not done so already, log into the server that contains the Kerberos Administrative Server and the KDC.
  2. Set the Kerberos Administrative Server to run as a service when the system starts. By default, the name of the Kerberos Administrative Server is kadmin. If you do not know how to do this, check the documentation for your operating system.
  3. Set the KDC to run as a service when the system starts. By default, the name of the KDC is krb5kdc.

Configure Spoon Client-Side Nodes

After you have added users to the database and configured the Kerberos admin and KDC to start when the server starts, you are ready to configure each client-side node from which a user might access the Hadoop cluster. Client-side nodes should each have a copy of Spoon already installed. Client-side configuration differs based on your operating system.

Configure Linux and Mac Client Nodes

Install JCE on Linux and Mac Clients

This step is optional. The KDC configuration includes an AES-256 encryption setting. If you want to use this encryption strength, you will need to install the Java Cryptographic Extension (JCE) files.
  1. Download the Java Cryptographic Extension (JCE) for the currently supported version of Java from the Oracle site.
  2. Read the installation instructions that are included with the download.
  3. Copy the JCE jars to the java/lib/security directory where PDI is installed on the Linux client machine.

Configure PDI for Hadoop Distribution and Version on Linux and Mac Clients

To configure DI to connect to the Hadoop cluster, you'll need to copy Hadoop configuration files from the cluster's name node to the appropriate place in the hadoop-configurations subdirectory.

  1. Back up the core-site.xml, hdfs-site.xml, and mapred-site.xml files that are in the design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/<directory of the shim that is in your plugin.properties file>.
  2. Copy the core-site.xml, hdfs-site.xml, and mapred-site.xml from the cluster's name node to this directory on each client: design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/<directory of the shim that is in your plugin.properties file>. Note: If you made configuration changes to the core-site.xml, hdfs-site.xml, or mapred-site.xml files previously, you will need to make those changes again. Reference your backed up copies of the files if needed.

Download and Install Kerberos Client on Linux and Mac Clients

Download and install a Kerberos client. Check your operating system's documentation for further details on how to do this.

Modify Kerberos Configuration File to Reflect Realm, KDC, and Admin Server on Linux and Mac Clients

Modify the Kerberos configuration file to reflect your Realm, KDC, and Admin Server.

  1. Open the krb5.conf file. By default this file is located in /etc/krb5.conf, but it might appear somewhere else on your system.
  2. Add your Realm, KDC, and Admin Server information. The information in-between the carats < > indicates where you should modify the code to match your specific environment.
    [libdefaults]
           default_realm = <correct default realm name>
    	clockskew = 300
    	v4_instance_resolve = false
    	v4_name_convert = {
    		host = {
    			rcmd = host
    			ftp = ftp
    		}
    		plain = {
    			something = something-else
    		}
    	}
    	
    [realms]
    	<correct default realm name>= {
    		kdc=<KDC IP Address, or resolvable Hostname>
    		admin_server=< Admin Server IP Address, or resolvable Hostname>
    	}
    	MY.REALM = {
    		kdc = MY.COMPUTER 
    	}
    	OTHER.REALM = {
    		v4_instance_convert = {
    			kerberos = kerberos
    			computer = computer.some.other.domain
    		}
    	}
    [domain_realm]
    	.my.domain = MY.REALM 
  3. Save and close the configuration file.
  4. Restart the computer.

Synchronize Clock on Linux and Mac Clients

Synchronize the clock on the Linux client with the clock on the Hadoop cluster. This is important because if the clocks are too far apart, then when authentication is attempted, Kerberos will not consider the tickets that are granted to be valid and the user will not be authenticated. The times on the Linux client clock and the Hadoop cluster clock must not be greater than the range you entered for the clockskew variable in krb5.conf file when you completed the steps in the Modify Kerberos Configuration File to Reflect Realm, KDC, and Admin Server on Linux Client task.

Consult your operating system's documentation for information on how to properly set your clock.

Obtain Kerberos Ticket on Linux and Mac Clients

To obtain a Kerberos ticket, complete these steps.

  1. Open a Terminal window and type kinit at the prompt.
  2. When prompted for a password, enter it.
  3. The prompt appears again. To ensure that the Kerberos ticket was granted, type klist at the prompt.
  4. Authentication information appears.

Configure Windows Client Nodes

Install JCE on Windows Client

This step is optional. The KDC configuration includes an AES-256 encryption setting. If you want to use this encryption strength, you will need to install the Java Cryptographic Extension (JCE) files.
  1. Download the Java Cryptographic Extension (JCE) for the currently supported version of Java from the Oracle site.
  2. Read the installation instructions that are included with the download.
  3. Copy the JCE jars to the java\lib\security directory where PDI is installed.

Configure PDI for Hadoop Distribution and Version on Windows Client

To configure PDI to connect to the Hadoop cluster, you'll need to copy Hadoop configuration files from the cluster's name node to the appropriate place in the hadoop-configurations subdirectory.

  1. Back up the core-site.xml, hdfs-site.xml, and mapred-site.xml files that are in the design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/<directory of the shim that is in your plugin.properties file>.
  2. Copy the core-site.xml, hdfs-site.xml, and mapred-site.xml from the cluster's name node to this directory on each client: design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/<directory of the shim that is in your plugin.properties file>. Note: If you made configuration changes to the core-site.xml, hdfs-site.xml, or mapred-site.xml files previously, you will need to make those changes again. Reference your backed up copies of the files if necessary.

Download and Install Kerberos Client on Windows Client

Download and install a Kerberos client. We recommend that you use the Heimdal implementation of Kerberos, which can be found here: https://www.secure-endpoints.com/heimdal/.

Modify Kerberos Configuration File to Reflect Realm, KDC, and Admin Server on Windows Client

You will need to modify the Kerberos configuration file to reflect the appropriate realm, KDC, and Admin Server.
  1. Open the krb5.conf file. By default this file is located in c:\ProgramData\Kerberos. This location might be different on your system.
  2. Add the appropriate realm, KDC, and Admin Server information. An example of where to add the data appears below.
    [libdefaults]
           default_realm = <correct default realm name>
    	clockskew = 300
    	v4_instance_resolve = false
    	v4_name_convert = {
    		host = {
    			rcmd = host
    			ftp = ftp
    		}
    		plain = {
    			something = something-else
    		}
    	}
    	
    [realms]
    	<correct default realm name>= {
    		kdc=<KDC IP Address, or resolvable Hostname>
    		admin_server=< Admin Server IP Address, or resolvable Hostname>
    	}
    	MY.REALM = {
    		kdc = MY.COMPUTER 
    	}
    	OTHER.REALM = {
    		v4_instance_convert = {
    			kerberos = kerberos
    			computer = computer.some.other.domain
    		}
    	}
    [domain_realm]
    	.my.domain = MY.REALM 
  3. Save and close the configuration file.
  4. Make a copy of the configuration file and place it in the c:\Windows directory. Rename the file krb5.ini.
  5. Restart the computer.

Synchronize Clock on Windows Client

Synchronize the clock on the Windows client with the clock on the Hadoop cluster. This is important because if the clocks are too far apart, then when authentication is attempted, Kerberos will not consider the tickets that are granted to be valid and the user will not be authenticated. The times on the Windows client clock and the Hadoop cluster clock must not be greater than the range you entered for the clockskew variable in krb5.conf file when you completed the steps in the Modify Kerberos Configuration File to Reflect Realm, KDC, and Admin Server on Windows Client task.

Consult your operating system's documentation for information on how to properly set your clock.

Obtain Kerberos Ticket on Windows Client

To obtain a Kerberos ticket, complete these steps.

  1. Open a Command Prompt window and type kinit at the prompt.
  2. When prompted for a password, enter it.
  3. The prompt appears again. To ensure that the Kerberos ticket was granted, type klist at the prompt.
  4. Authentication information appears.

Test Authentication from Within Spoon

To test the authentication from within Spoon, run a transformation that contains a step that connects to a Hadoop cluster. For these instructions to work properly, you should have read and write access to the your home directory on the Hadoop cluster.

  1. Start Spoon.
  2. Open an existing transformation that contains a step to connect to the Hadoop cluster. If you don't have one, consider creating something like this.
    1. Create a new transformation.
    2. Drag the Generate Rows step to the canvas, open the step, indicate a limit (the number of rows you want to generate), then put in field information, such as the name of the field, type, and a value.
    3. Click Preview to ensure that data generates, then click the Close button to save the step.
    4. Drag a Hadoop File Output step onto the canvas, then draw a hop between the Generate Rows and Hadoop File Output steps.
    5. In the Filename field, indicate the path to the file that will contain the output of the Generate Rows step. The path should be on the Hadoop cluster. Make sure that you indicate an extension such as txt and that you want to create a parent directory and that you want to add filenames to the result.
    6. Click the OK button then save the transformation.
  3. Run the transformation. If there are errors correct them.
  4. When complete, open a Terminal window and view the results of the output file on the Hadoop filesystem. For example, if you saved your file to a file named test.txt, you could type a command like this:
    hadoop fs -cat /user/pentaho-user/test/test.txt