Skip to main content
Pentaho Documentation

Using the YARN Workspace Folder to Copy Files to the YARN Cluster

Overview

Explains how to use the YARN Workspace folder.

If you start a job that will run on a YARN cluster, but it needs other files to execute - such as variables from your local copy of  kettle.properties - those files will need to be copied to the YARN cluster.  An easy way to do this is to add those files to the YARN Workspace folder.  At runtime PDI copies all of the files in the YARN Workspace folder to the YARN cluster.  This feature is well-suited for jobs that move through the development, testing, and staging lifecycle because the job uses the appropriate configuration files in the KETTLE_HOME directory for the environment in which it runs.  

Files in the YARN Workspace folder are copied to the YARN cluster every time you run a job that starts the YARN Kettle Cluster.   If you don't want to overwrite files that have the same names that are already on the YARN Kettle Cluster, delete files from the YARN Workspace folder.  Then, in the Start a YARN Kettle Cluster step window, deselect the appropriate checkboxes in the Copy Local Resource Files to YARN section of the window.

Add Files to the YARN Workspace Folder 

These instructions explain how to configure the Start a YARN Kettle Cluster entry so that following files are copied at runtime, to the YARN Workspace folder and then to the YARN cluster: kettle.properties, shared.xml, and repositories.xml.  These instructions also explain how to manually copy additional files to the folder. 

If the job is run from your local installation, the configuration files from your KETTLE_HOME directory are copied to the YARN Workspace folder.  If the job is scheduled or is run on a Pentaho DI Server, the configuration files from the server's configured KETTLE_HOME are copied to the YARN Workspace folder.  

Complete these steps.

  1. Set the active YARN Hadoop cluster using the instructions found in Configuring Pentaho for Your Hadoop Distro and Version.
  2. Complete the instructions in the Additional Configuration for YARN shims article.
  3. In Spoon, create or open a job that contains the Start a YARN Kettle Cluster entry. 
  4. Open the Start a YARN Kettle Cluster entry.
  5. Select any combination of the kettle.properties, shared.xml, and repository.xml checkboxes in the Copy Local Resource Files to YARN section of the window.
  6. Save and close the Start a YARN Kettle Cluster entry.
  7. If you want to copy other files to the cluster, manually copy them to the YARN Workspace folder here: pentaho-big-data-plugin/plugins/pentaho-kettle-yarn-plugin/workspace.
  8. Save and run the job.

At runtime, the kettle.properties, shared.xml, and repositories.xml files (whatever was selected) are copied to the YARN Workspace folder and then to the YARN cluster.

Delete Files from the YARN Workspace Folder

To delete files from the YARN Workspace folder manually remove them.  The YARN Workspace Folder is kept here: pentaho-big-data-plugin/plugins/pentaho-kettle-yarn-plugin/workspace.