Skip to main content
Pentaho Documentation

Pentaho MapReduce

This job entry executes transformations as part of a Hadoop MapReduce job in place of a traditional Hadoop Java class. A Hadoop MapReduce job is made up of any combination of following types of transformations:

  • The Mapper transformation takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). It performs filtering and sorting (such as sorting students by first name into queues, one queue for each name). It applies a given function to each element of a list, returning a list of results in the same order.

  • The Combiner transformation summarizes the map output records with the same key, which helps to reduce the amount of data written to disk, and transmitted over the network.
  • The Reducer transformation performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). It analyzes a recursive data structure and through use of a given combining operation, recombine the results of recursively processing its constituent parts, building up a return value.

This entry was formerly known as Hadoop Transformation Job Executor.

With the Pentaho MapReduce entry, you specify PDI transformations to use for the mapper, combiner, and/or reducer through their related tabs. The mapper transformation is required. The combiner and reducer transformations are optional. See Working with Big Data and Hadoop in PDI for details on how PDI works with Hadoop clusters.

The Hadoop job name field in the Cluster tab is required and must be specified for the Pentaho MapReduce entry to work.

General

Use the Entry Name field to Specifies the unique name of the job entry on the canvas. The Entry Name is set to Pentaho MapReduce by default.

Options

The Pentaho MapReduce job entry features several tabs to define your transformations and setup the connection with the Hadoop cluster. Each tab is described below.

Mapper Tab

Pentaho MapReduce - Mapper Tab

The following table describes the options for defining a mapper transformation, which is required by this entry:

Option Definition

Transformation

Specify the transformation that will perform the mapping functions for this job by entering its path or clicking Browse.

If you select a transformation that has the same root path as the current transformation, the variable ${Internal.Entry.Current.Directory} will automatically be inserted in place of the common root path. For example, if the current transformation's path is /home/admin/transformation.ktr and you select a transformation in the folder /home/admin/path/sub.ktr than the path will automatically be converted to ${Internal.Entry.Current.Directory}/path/sub.ktr.

If you are working with a repository, specify the name of the transformation in your repository. If you are not working with a repository, specify the XML file name of the transformation on your system.

Transformations previously specified by reference are automatically converted to be specified by name within the Pentaho Repository.

Input step name Specify the name of the step that receives mapping data from Hadoop. It must be a MapReduce Input step.
Output step name Specify the name of the step that passes mapping output back to Hadoop. It must be a MapReduce Output step.

Combiner Tab

Pentaho MapReduce Entry - Combiner Tab

The following table describes the options for defining a combiner transformation:

Option Definition
Transformation

Specify the transformation that will perform the combiner functions for this job by entering its path or clicking Browse.

You can use any internal variable to specify the path. For example, if you select a transformation that is located in the same folder as the current transformation, you can use the ${Internal.Entry.Current.Directory} internal variable to define the path.

If you are working with a repository, specify the name of the transformation in your repository. If you are not working with a repository, specify the XML file name of the transformation on your system.

Transformations previously specified by reference are automatically converted to be specified by name within the Pentaho Repository.

Input step name

Specify the name of the step that receives combiner data from Hadoop. It must be a MapReduce Input step.
Output step name Specify the name of the step that passes combiner output back to Hadoop. It must be a MapReduce Output step.
Use single threaded transformation engine Select to indicate the Single Threaded transformation execution engine should be used to execute the combiner transformation. If not selected, the normal multi-threaded transformation engine will be used. The Single Threaded transformation execution engine reduces overhead when processing many small groups of output.

Reducer Tab

Pentaho MapReduce Entry - Reducer Tab

The following table describes the options for defining a reducer transformation:

Option Definition
Transformation

Specify the transformation that will perform the reducer functions for this job by entering its path or clicking Browse.

You can use any internal variable to specify the path. For example, if you select a transformation that is located in the same folder as the current transformation, you can use the ${Internal.Entry.Current.Directory} internal variable to define the path.

If you are working with a repository, specify the name of the transformation in your repository. If you are not working with a repository, specify the XML file name of the transformation on your system.

Transformations previously specified by reference are automatically converted to be specified by name within the Pentaho Repository.

Input step name Specify the name of the step that receives reducing data from Hadoop. It must be a MapReduce Input step.
Output step name Specify the name of the step that passes reducing output back to Hadoop. It must be a MapReduce Output step.
Use single threaded transformation engine Select to indicate the Single Threaded transformation execution engine should be used to execute the reducer transformation. If not selected, the normal multi-threaded transformation engine will be used. The Single Threaded transformation execution engine reduces overhead when processing many small groups of output.

Job Setup Tab

Pentaho MapReduce Entry - Job Setup Tav

The following table describes the options for setting up the inputs and outputs of the job:

Option Definition
Input path

Enter the path of the input directory, such as /wordcount/input, from your Hadoop cluster where the source data for the MapReduce job is stored. A comma-separated list can be used for multiple input directories.

Output path

Enter the path of the directory, such as /wordcount/output, on your Hadoop cluster where you want the output from the MapReduce job to be stored.

The output directory cannot exist prior to running the MapReduce job.

Remove output path before job Select to remove the specified output path before the MapReduce job is scheduled.

Input format

Enter the Apache Hadoop class name that describes the input specification for the MapReduce job. See InputFormat for more information.

Output format

Enter the Apache Hadoop class name that describes the output specification for the MapReduce job. See OutputFormat for more information.

Ignore output of map key

 

Select to ignore the key output from the mapper transformation and replace it with NullWritable.

Ignore output of map value

Select to ignore the value output from the mapper transformation and replace it with NullWritable.

Ignore output of reduce key

Select to ignore the key output from the combiner and/or reducer transformations and replace them with NullWritable. This requires a reducer transformation to be used, not the Identity Reducer.

Ignore output of reduce value

Select to ignore the key output from the combiner and/or reducer transformations and replace them with NullWritable. This requires a reducer transformation to be used, not the Identity Reducer.

Cluster Tab

Pentaho MapReduce Entry - Cluster Tab

The following table describes the options for setting up configurations for the Hadoop cluster connection:

Option Definition
Hadoop job name Enter the name of the Hadoop job you are running. It is required for the Pentaho MapReduce entry to work.

Hadoop Cluster

Specify the configuration of your Hadoop cluster through the following options:

  • Select an existing configuration. If your configuration does not appear in this list, create it with the New button.
  • Click Edit to use the Hadoop cluster dialog box to modify an existing configuration. See the Hadoop Cluster Configuration section for further details on this dialog box.
  • Click New to use the Hadoop cluster dialog box to create a new configuration. See the Hadoop Cluster Configuration section for further details on this dialog box.

See Work with Big Data for general information on Hadoop cluster configurations.

Number of Mapper Tasks

Enter the number of mapper tasks you want to assign to this job. The size of the inputs should determine the number of mapper tasks. Typically, there should be between 10-100 maps per node, though you can specify a higher number for mapper tasks that are not CPU-intensive.

Number of Reducer Tasks

Enter the number of reducer tasks you want to assign to this job. Lower numbers mean that the reduce operations can launch immediately and start transferring map outputs as the maps finish. The higher the number, the quicker the nodes will finish their first round of reduces and launch a second round. Increasing the number of reduce operations increases the Hadoop framework overhead, but improves load balancing.

If this is set to 0, then no reduce operation is performed, and the output of the mapper becomes the output of the entire job. Combiner operations will also not be performed.

Logging Interval

Enter the number of seconds between log messages.

Enable Blocking

Select to forces the job to wait until each step completes before continuing to the next step. This is the only way for PDI to be aware of a Hadoop job's status.

If this option is not selected, the Hadoop job blindly execute, and PDI will move on to the next job entry. Error handling and routing will not work unless this option is selected.

Hadoop Cluster Configuration

When you click the Hadoop Cluster Edit or New button, the Hadoop cluster dialog box appears. Use this dialog box to specify configuration details such as host names and ports for HDFS, Job Tracker, and other big data cluster components. These configuration options are reused in the related transformation steps and job entries that support big data features.

Option Definition

Cluster Name

Enter the name that you assign the cluster configuration.

Use MapR Client Indicates that this configuration is for a MapR cluster. If this option is selected, the fields in the HDFS and JobTracker sections are disabled because those parameters are not needed to configure MapR.
Hostname (in HDFS section) Enter the hostname for the HDFS node in your Hadoop cluster. 
Port (in HDFS section) Enter the port for the HDFS node in your Hadoop cluster.  
Username (in HDFS section) Enter the username for the HDFS node.
Password (in HDFS section) Enter the password for the HDFS node.
Hostname (in JobTracker section) Enter the hostname for the JobTracker node in your Hadoop cluster. If you have a separate job tracker node, type in the hostname here. Otherwise, use the HDFS hostname. 

Port (in JobTracker section)

Enter the port for the JobTracker in your Hadoop cluster. Job tracker port number cannot be the same as the HDFS port number. 
Hostname (in ZooKeeper section) Enter the hostname for the Zookeeper node in your Hadoop cluster. 

Port (in Zookeeper section)

Enter the port for the Zookeeper node in your Hadoop cluster. 
URL (in Oozie section) Enter a URL of a valid Oozie location.

After you have finished setting these configuration options, perform the following steps:

  1. Click Test to try your configurations on the Hadoop cluster. If you are unable to connect, see Connect to a Hadoop Cluster in the PDI client for further details on Hadoop cluster connections.
  2. Click OK to return to the Cluster tab.

User Defined Tab

Pentaho MapReduce Entry - User Defined Tab

The following table describes the options for defining user-defined parameters and variables:

Column Definition
Name

Enter the name of the user-defined parameter or variable that you want to set. To set a java system variable, preface the variable name with java.system (java.system.SAMPLE_VARIABLE for example).

Kettle variables that are set here override the Kettle variables set in the kettle.properties file. For more information on how to set a kettle variable, see Set Kettle Variables.

Value Enter the value of the user-defined parameter or variable that you want to set.