Skip to main content
Pentaho Documentation

Advanced settings for connecting to Google Dataproc

Parent article

This article explains advanced settings for configuring the Pentaho Server to connect to Google Dataproc.

Before you begin

Before you begin setting up Pentaho to connect to a Google Dataproc cluster, you must perform the following tasks.

Procedure

  1. Check the Components Reference to verify that your Pentaho version supports your version of Google Dataproc.

  2. Prepare to use Google Dataproc by performing the following tasks:

    1. Obtain the required credentials for a Google account and access to the Google Cloud Console.

    2. Obtain the required credentials for Google Cloud Platform, Compute Engine, and Google Dataproc from your system administrator.

  3. Contact yor Hadoop administrator to obtain the connection information for the cluster and services that you intend to use. Some of this information may be available from a cluster management tool. You also need to supply some of this information to users after you are finished.

Create a Dataproc cluster

You can create a Dataproc cluster using several different methods. For more information on setting up your cluster, see the Google Cloud Documentation.

Install the Google Cloud SDK on your local machine

Use the Google Cloud Documentation to learn how to install the Google Cloud SDK on your supported platform.

Set command variables

Setting these commonly used command variables makes it easier to run command-line examples on your local machine or in Cloud Shell.

Perform the following steps to set command variables.

Procedure

  1. Export the project using the following example:

    $ export PROJECT=project;export HOSTNAME=hostname;export ZONE=zone
    1. Set the PROJECT variable to your Google Cloud project ID.

    2. Set the HOSTNAME variable to the name of the master node in your Dataproc cluster.

      NoteThe master name ends with an -m suffix.
    3. Set the ZONE variable to the zone of the instances in your Dataproc cluster.

Set up a Google Compute Engine instance for PDI

The PDI client must be run from within the Google Compute Engine (GCE). Users must be able to connect remotely to the instance using a Virtual Network Connection (VNC) service to see the Gnome desktop and run the PDI client. Because VM instances running on the GCE do not publicly expose the ports required to establish a remote desktop connection, you must also create an SSH (Secure Shell) tunnel between the remote PDI client and the local machine.

Perform the following procedures to set up a PDI client instance in the Google Compute Engine and use it as a client instance for Dataproc.

Procedure

  1. In the GCP platform dashboard, navigate to the Compute Engine console.

  2. Navigate from the menu to Compute Engine VM Instances.

    1. Click Create Instance.

    2. Click Advanced options Networking tab.

    3. In the Network Tags text box, enter vnc-server.

  3. Install and update a working VNC service for the remote user interface.

  4. Log in to the instance using SSH.

    1. Use a locally installed SSH client command line to access the remote client instance using its external IP address.

      NoteThe console displays the external IP.
    2. Use the Compute Engine list of active virtual machines and select SSH from the list next to the virtual machine you want to use.

  5. Update the operating system on the virtual machine.

  6. Install Gnome and VNC.

  7. Create an SSH tunnel from your VNC client machine.

  8. Connect to the VNC.

  9. (Optional) Configure and log in to Kerberos on your client instance.

    If you are using Kerberos, the VM instance running PDI in GCE must be configured with Kerberos to work with a Kerberos-enabled Dataproc cluster. Kerberos must be properly configured and the client machine must be authenticated with the Kerberos controller.

Results

When successful, you can see a remote desktop with PDI running in the compute engine. You can use PDI to design and launch jobs and transformations on a cluster created in Google Dataproc.

Edit configuration files for users

Your cluster administrator must download configuration files from the cluster for the applications your teams are using, and then edit them to include Pentaho-specific and user-specific parameters. When edited, provide these modified files to the applicable users who must copy the files into the their directory: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name>.

When creating a named connection, the <user-defined connection name> directory is also created. When you set up the named connection, PDI copies these configuration files into that directory. The cluster administrator must provide users with the name to assign the named connection, so that PDI can copy these modified files into that directory.

The following files must be provided to your users:

  • core-site.xml
  • hdfs-site.xml
  • mapred-site.xml
  • yarn-site.xml
  • hive-site.xml
NoteYou can obtain the site files from the Dataproc cluster by using SCP (Secure Copy Protocol) to copy them locally.

Edit the XML file for MapReduce

If you are using MapReduce, you wil need to edit the mapred-site.xml file to indicate where the job history logs are stored and to allow MapReduce jobs to run across platforms.

Perform the following steps to edit the mapred-site.xml file.

Procedure

  1. Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the mapred-site.xml file.

  2. Add the following values for the parameter:

    ParameterValue
    mapreduce.app-submission.cross-platformThis property is only needed to run MapReduce jobs on Windows platforms.
    <property>
      <name>mapreduce.app-submission.cross-platform</name>
      <value>true</value>
    </property>
  3. Save and close the file.

Connect to a Hadoop cluster with the PDI client

After you have set up the Pentaho Server to connect to a cluster, you must configure and test the connection to the cluster. For more information about setting up the connection, see Connecting to a Hadoop cluster with the PDI client.

Connect other Pentaho components to Dataproc

The following sections explain how to create and test a connection to the cluster in the Pentaho Server, Pentaho Report Designer, and Pentaho Metadata Editor. Creating and testing a connection to the cluster in the PDI client includes the following tasks:

Create and test connections

For each Pentaho component, create the test as described in the following list.

  • Pentaho Server for DI

    Create a transformation in the PDI client and run it remotely.

  • Pentaho Server for BA

    Create a connection to the cluster in the Data Source Wizard.

  • PME

    Create a connection to the cluster in PME.

  • PRD

    Create a connection to the cluster in PRD.

After you have connected to the cluster and its services properly, provide the connection information to users who need access to the cluster and its services. Those users can only access the cluster on machines that are properly configured to connect to the cluster.

To connect, users need the following information:

  • Hadoop distribution and version of the cluster
  • HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
  • Oozie URL (if used)

Users also require permissions to access the directories they need on HDFS, such as their home directory and any other required directories.

They might also need more information depending on the job entries, transformation steps, and services they use. For a detailed list of information that your users need to use supported Hadoop services, see Hadoop connection and access information list.