Skip to main content
Pentaho Documentation

Advanced settings for connecting to a Amazon EMR cluster

Parent article

This article explains advanced settings for configuring the Pentaho Server to connect to a working Amazon EMR cluster.

Before you begin

Before you begin setting up Pentaho to connect to an Amazon EMR cluster, you must perform the following tasks.

Procedure

  1. Check the Components Reference to verify that your Pentaho version supports your version of the Amazon EMR cluster.

  2. Prepare your Amazon EMR cluster by performing the following tasks:

    1. Configure an Amazon EC2 cluster.

      See Amazon's documentation if you need help.
    2. Install any required services and service client tools.

    3. Test the cluster.

  3. Install PDI on an Amazon EC2 instance that is within the same Amazon Virtual Private Cloud (VPC) as the Amazon EMR cluster.

    NoteNote: As a best practice, you should install PDI on your Amazon EC2 instance. Otherwise, you might not be able to write or read files to or from the cluster. To resolve this issue, see Unable to read or write files to HDFS on the Amazon EMR cluster.
  4. 4. Get the connection information for the cluster and services that you intend to use from your Hadoop administrator. Some of this information may be available from a cluster management tool. You also need to supply some of this information to users after you are finished.

  5. Add the YARN user on the cluster to the group defined by dfs.permissions.superusergroup property. The dfs.permissions.superusergroup property can be found in hdfs-site.xml file on your cluster or in the cluster management application.

Edit configuration files for users

Your cluster administrator must download configuration files from the cluster for the applications your teams are using, and then edit them to include Pentaho-specific and user-specific parameters. These modified files must be provided to the users and must be copied to the user's directory: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name>.

When the user creates a named connection, this <user-defined connection name> directory is created. When the user sets up the named connection, PDI copies these configuration files into that directory. The cluster administrator must provide the user with the name to assign the named connection, so that PDI can copy these modified files into that directory.

The following files must be provided to your users:

  • core-site.xml
  • mapred-site.xml
  • hdfs-site.xml

Verify or edit core-site XML file

NoteIf you plan to run MapReduce jobs on an Amazon EMR cluster, make sure you have read, write, and execute access to the S3 Buffer directories specified in the core-site.xml file on the EMR cluster.

You must edit the core-site.xml file to add information about your AWS Access Key ID, your Access key, and your LZO compression setting.

Perform the following steps to edit your core-site.xml:

Procedure

  1. Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the core-site.xml file.

  2. Add the following values:

    ParameterValues
    fs.s3.awsAccessKeyId​Value of your S3 AWS Access Key ID.
    <property>   
       <name>fs.s3.awsAccessKeyId</name>   
       <value>[INSERT YOUR VALUE HERE]</value>
    </property>
    fs.s3.awsSecretAccessKeyValue of your AWS secret access key.
    <property>   
       <name>fs.s3.awsSecretAccessKey</name>   
       <value>[INSERT YOUR VALUE HERE]</value>
    </property>
  3. If needed, enter the AWS Access Key ID and Access Key for S3N like this:

    ParameterValues
    fs.s3n.awsAccessKeyIdValue of your S3N AWS Access Key ID.
    <property>
       <name>fs.s3n.awsAccessKeyId</name>
       <value>[INSERT YOUR VALUE HERE]</value>
    </property>
    fs.s3n.awsSecretAccessKeyValue of your 3N AWS secret access key.
    <property>
       <name>fs.s3n.awsSecretAccessKey</name>
       <value>[INSERT YOUR VALUE HERE]</value>
    </property>
  4. Add the following values:

    ParameterValues
    fs.s3n.impl
    <property>
       <name>fs.s3n.impl</name>
       <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
    </property>
    fs.s3.impl
    <property>
       <name>fs.s3.impl</name>
       <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
    </property>
  5. LZO is a compression format that Amazon EMR supports. If you want to configure for LZO compression, you need to download a JAR file. If you do not, you need to remove a parameter from the core-site.xml file.

    • If you are not using LZO compression, remove any references to the iocompression parameter in the core-site.xml file: com.hadoop.compression.lzo.LzoCodec
    • If you are not using LZO compression, download the LZO JAR and add it to pentaho-big-data-plugin/hadoop-configurations/emr3x/lib directory. The LZO JAR can be found here: http://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.19/.
  6. Save and close the file.

Edit mapred-site XML file

If you are using MapReduce, you must edit the mapred-site.xml file to indicate where the job history logs are stored and to allow MapReduce jobs to run across platforms.

Perform the following steps to edit the mapred-site.xml file:

Procedure

  1. Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the mapred-site.xml file.

  2. Add the following values:

    ParameterValue
    mapreduce.app-submission.cross-platformThis property allows MapReduce jobs to run on either Windows client or Linux server platforms.
    <property>
      <name>mapreduce.app-submission.cross-platform</name>
      <value>true</value>
    </property>
  3. Save and close the file.

Connect to a Hadoop cluster with the PDI client

After you have set up the Pentaho Server to connect to a cluster, you must configure and test the connection to the cluster. For more information about setting up the connection, see Connecting to a Hadoop cluster with the PDI client.

Connect other Pentaho components to the Amazon EMR cluster

The following sections explain how to create and test a connection to the cluster in the Pentaho Server, PRD, and PME. Creating and testing a connection to the cluster in the PDI client involves two tasks:

Create and test connections

For each Pentaho component, create the test as described in the following list.

  • Pentaho Server for DI

    Create a transformation in the PDI client and run it remotely.

  • Pentaho Server for BA

    Create a connection to the cluster in the Data Source Wizard.

  • PME

    Create a connection to the cluster in PME.

  • PRD

    Create a connection to the cluster in PRD.

After you have connected to the cluster and its services properly, provide the connection information to users who need access to the cluster and its services. Those users can only access the cluster on machines that are properly configured to connect to the cluster.

To connect, users need the following information:

  • Hadoop distribution and version of the cluster
  • HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
  • Oozie URL (if used)

Users also require permissions to access the directories they need on HDFS, such as their home directory and any other required directories.

They might also need more information depending on the job entries, transformation steps, and services they use. For a detailed list of information that your users need to use supported Hadoop services, see Hadoop connection and access information list.