Skip to main content
Pentaho Documentation

Use Knox to access Hortonworks

Parent article

Apache Knox is a gateway security tool that provides perimeter security for the Hortonworks Distribution (HDP) of Hadoop services. Knox provides secure access to the Hadoop components on a cluster. Connecting to a cluster using Knox provides you with a single point of access to connect to Hadoop services, eliminating the need to map to each service separately. If your system administrator has implemented Apache Ranger on the cluster, Pentaho will respect the policies your system administrator has set up.

Here is an example of a Knox deployment:

The PDI client connects to Knox using a user ID and password that is registered in LDAP. Knox then authenticates to the Kerberos Key Distribution Center (KDC) with the PDI client user ID and password. Lastly, Knox authorizes with Ranger and submits the request to the Hadoop cluster.

Knox environment

Before you begin

Before you begin, you will need to obtain the following items from your system administrator:

  • Credentials

    Includes the cluster name, gateway URL, username, and password.

  • SSL certificate

    The SSL certificate must be installed by your system administrator. The Knox URL is a secure URL, so an SSL certificate is needed to successfully perform operations using a Knox gateway. See Configure SSL (HTTPS) in the Pentaho User Console and Server for information on SSL.

  • LDAP directory server

    Authentication with Knox is provided by an LDAP directory server, so you must be able to authenticate to an LDAP server. For more information, review the articles Switch to LDAP and LDAP Properties.

Setup

Complete the following processes to set up the Knox server for use with Pentaho:

  1. Display the Knox Gateway option
  2. Register a Hadoop Cluster
  3. Access Cluster Resources using a Knox Gateway URL.

Display the Knox gateway option

You must modify the KETTLE_HADOOP_CLUSTER_GATEWAY_CONNECTION environment variable in the kettle.properties file to display the Use gateway to connect to the cluster option. This option must be selected to set up the gateway connection to the cluster.

Perform the following steps to set the environment variable:

Procedure

  1. In the PDI client, choose Edit Edit the kettle.properties file to open the Kettle properties dialog box.

  2. Locate the KETTLE_HADOOP_CLUSTER_GATEWAY_CONNECTION variable.

  3. Change the KETTLE_HADOOP_CLUSTER_GATEWAY_CONNECTION value to true, and click OK.

Results

The Use a gateway to connect to the cluster option on the Hadoop Cluster dialog box displays as shown below.Hadoop cluster dialog box

Register a Hadoop cluster

In order to connect to a cluster using the Knox Gateway, you must register a Hadoop Cluster in Spoon. Perform the following steps to connect Knox to the cluster:

Procedure

  1. In the PDI client, double-click your cluster name to open the Hadoop Cluster dialog box.Select Hadoop Cluster dialog box

  2. Select the Use gateway to connect to the cluster option in the Hadoop Cluster dialog box. The dialog box will change to display the gateway connection options.

  3. Enter the gateway URL, username, and password in the Gateway area, then click OK.gateway connection options

    NoteZookeeper is not supported with Knox.

Access cluster resources using a Knox gateway URL

Knox uses a virtual filesystem (VFS) to connect to the cluster, where the cluster resources are accessed through a URL. When you set up Pentaho to connect to a Hortonworks cluster, you created a name for your cluster. Pentaho uses that cluster name in a URL to run your transformations and jobs with Knox. You can use the PDI client to generate the gateway URL for your Hadoop cluster that Knox needs to connect to the cluster.

Complete the following steps to create the URL to connect to the resources on the cluster:

Procedure

  1. In the PDI client, click File Open File URL.

    The Open File dialog box displays.
  2. Click the Location drop-down menu and select Hadoop Cluster from the list.

  3. Click the Hadoop Cluster drop-down menu and select your cluster name.

  4. In the Open from Folder text box, the gateway URL for the cluster displays in the format hc://<cluster name>.

Results

The files and folders on the cluster display in the Name panel.Open File dialog box for Hadoop Cluser connection

Hive configuration with Knox

You can configure your Hive database with Knox.

Procedure

  1. Open the connection to your Hive database, or review the article Set Up a Database Connection for instructions on setting up a connection.

  2. In the Database Connection dialog box, select Options in the page panel on the left to display the Parameters panel.

  3. Enter the following parameters and values in the Options section and click OK.

    ParameterDefinitionValue
    httpPathPath to databasegateway/MyHDPCluster/hive
    knoxOption to use Knoxtrue
    transportModeConnection protocol to usehttp
    sslOption to use SSLtrue
    Database Connection dialog box

Results

You are now ready to use this connection for any Hive steps.