Skip to main content
Pentaho Documentation

Managing Reusable Hadoop Cluster Configuration Settings

Overview

Explains how to use the Hadoop Clusters feature to store your configuration data for reuse.

When you configure a job or transformation to use a Hadoop cluster, you can store some of the cluster configuration settings, like hostnames and port numbers, so they can be reused.  This saves you time because you do not have to enter the same configuration information again.  

This feature is not available for all steps and entries.  Check the step and entry documentation to see whether a specific step or entry supports reusable Hadoop Cluster configurations.

Specify New Hadoop Cluster Configurations

You can easily specify new Hadoop cluster configuration that can be reused in other places. The cluster configuration information is available for other users if you are connected to a repository when it is created.  If you are not connected to the repository when you create the Hadoop Cluster configuration, then the configuration is available for use in your other steps and entries that support this feature.  You can specify new Hadoop cluster configurations in three places:

  • Individual transformation steps and job enties such as the Pentaho MapReduce job entry
  • Transformation or job View tab,
  • Repository Explorer window.  

Specify Hadoop Cluster Configurations in a Step or Entry

To specify Hadoop cluster configurations in a step or entry, do the following.   

  1. In Spoon, create a new job or transformation or open an existing one.
  2. Drag a step or entry that supports named Hadoop cluster configurations to the Spoon canvas.  
  3. Click the New button that is next to the Hadoop Cluster field.  Since the Hadoop Cluster field location varies, see the step or entry documentation for the location of the field.  The following screenshot shows the Hadoop Cluster field in the Oozie Job Executor entry.

OozieJobExecutor.png

  1. The Hadoop cluster window appears.  Enter a name for the configuration, then enter the rest of the information for the cluster configuration.  

HadoopClusterWindow.png

  1. When complete, click the OK button.  The new configuration appears in the drop down list.

Specify Hadoop Cluster Configurations in the View Tab

To specify Hadoop cluster configurations in the transformation or job View tab, complete these steps.

  1. In Spoon, create a new job or transformation or open an existing one.
  2. Click the View tab.

view_and_hadoop_clusters.png

  1. Right-click the Hadoop cluster folder, then click New.  The Hadoop cluster window appears.  
  2. Enter a name for the configuration, then enter the rest of the information for the cluster.
  3. When complete, click the OK button.  The new Hadoop cluster configuration appears under the Hadoop clusters folder.    

Specify Hadoop Cluster Configurations in the Repository Explorer

To specify Hadoop cluster configurations in the Repository Explorer window, do the following.

  1. In Spoon, connect to the repository where you want to store the transformation or job.
  2. Select Tools > Repository > Explore to open the Repository Explorer window.
  3. Click the Hadoop clusters tab.
  4. Click the New button. The Hadoop Cluster window appears.
  5. Enter a name for the configuration, then enter the rest of configuration information.
  6. When complete, click the OK button.  The new Hadoop cluster appears in the list.

Edit Hadoop Cluster Configurations

You can edit Hadoop cluster configurations in three places:

  • Individual transformation steps and job enties such as the Pentaho MapReduce job entry
  • Transformation or job View tab
  • Repository Explorer window

How updates occur depend on whether you are connected to the repository.

  • If you are connected to a repository, Hadoop Cluster configuration changes are picked up by all transformations and jobs in the repository.   The Hadoop Cluster configuration is loaded during execution unless it cannot be found.  If the configuration cannot be found, the configuration values  that were stored when the transformation or job was saved are used instead.
  • If you are not connected to a repository, the Hadoop Cluster configuration changes are only picked up by your local (file system) transformations and jobs.  If you run these transformations and jobs outside of Kettle, they will not have access to the Hadoop Cluster configuration, so a copy of the configuration is saved as a fallback.  Note that changes to the Hadoop Cluster configuration are not updated in any transformations or jobs for the purpose of fallback unless they are re-saved.  

We recommend that you use Kettle variables for each value in the Hadoop Cluster configuration to mitigate some of the risk associated with running jobs and transformations in environments that are disconnected from the repository. 

Edit Hadoop Cluster Configuration in a Step or Entry

To edit Hadoop cluster configurations in a step or entry, complete these steps.  

  1. In Spoon, open the step or entry that has the Hadoop cluster configuration you want to edit. 
  2. In the Hadoop Cluster field, select the configuration from the drop down menu, then click the Edit button.  Since the Hadoop Cluster field location varies, see the step or entry documentation for the location of the field.
  3. The Hadoop cluster window appears.  Make changes as needed.
  4. When finished, click the OK button.

Edit Hadoop Cluster Configurations in the View Tab

To edit Hadoop cluster configurations from the transformation or job View tab, complete these steps.  

  1. Open the transformation or job in Spoon.
  2. Click the View tab.
  3. Click the Hadoop Clusters folder to open it.
  4. Right-click the configuration you want to edit, then select Edit.  The Hadoop cluster window appears.  Make changes as needed.
  5. When finished, click the OK button.

Edit Hadoop Cluster Configurations in the Repository Explorer

To edit Hadoop cluster configurations from the Repository Explorer window, do the following.

  1. In Spoon, connect to the repository where you stored the transformation or job.
  2. Select Tools > Repository > Explore to open the Repository Explorer window.
  3. Click the Hadoop Clusters tab.
  4. Select the configuration you want to edit, then click the Edit button.
  5. The Hadoop cluster window appears.  Make changes as needed. 
  6. When finished, click the OK button.

Duplicate a Hadoop Cluster Configuration

To duplicate or clone a Hadoop Cluster configuration, do the following.

  1. Open a transformation or job in Spoon.
  2. Click the View tab. 
  3. Click the Hadoop clusters folder to see its contents.
  4. Right-click the Hadoop cluster you want to duplicate and select Duplicate.
  5. The Hadoop cluster window appears.  Enter a different name in the Cluster Name field.
  6. Click OK.

Delete Hadoop Cluster Configuration

You can delete Hadoop cluster configurations as needed.  Once you delete a configuration, it cannot be restored, but you can always specify a new Hadoop cluster configuration again.  

Note that you can still run transformations and jobs that reference deleted named Hadoop cluster configurations because configuration details are stored in the transformation and job metadata files.

Delete Hadoop Cluster Configurations in the View Tab

To delete Hadoop cluster configuration in a transformation or job, complete these steps.

  1. Open a transformation or job in Spoon.
  2. Click the View tab. 
  3. Click the Hadoop clusters folder to see its contents.
  4. Right-click the Hadoop cluster you want to delete and select Delete.
  5. A message appears asking if you really want to delete the configuration.  Click Yes.

Delete Hadoop Cluster Configuration in the Repository Explorer

To delete Hadoop cluster configurations  from the Repository Explorer window, do the following.

  1. In Spoon, connect to a repository, then select Tools > Repository > Explore.
  2. Click the Hadoop Clusters tab.
  3. Click the Hadoop cluster configuration you want to delete and click the Delete button.
  4. A message appears asking if you really want to delete the Hadoop cluster configuration.  Click Yes.