Skip to main content
Pentaho Documentation

Using Spark with PDI

Overview

Quisque mattis rutrum ligula et pellentesque. Pellentesque massa enim, pulvinar quis quam quis, aliquet dapibus augue. Aliquam feugiat suscipit turpis, nec ultrices dolor adipiscing ut.

These instructions explain how to use the Spark Submit job entry to run the Word Count sample on a text file that you supply. 

Install the Spark Client

Before you start, you must install and configure the Spark client according to the instructions in the Spark Submit job entry, which can be found here: Spark Submit.

Modify the Spark Sample

The following example demonstrates how to use PDI to submit a Spark job.

Open and Rename the Job

To copy files in these instructions, use either the Hadoop Copy Files job step or Hadoop command line tools.  For an example of how to do this using PDI, check out our tutorial at http://wiki.pentaho.com/display/BAD/Loading+Data+into+HDFS.

  1. Copy a text file that contains words that you would like to count to the HDFS on your cluster.
  2. Start Spoon.
  3. Open the Spark Submit.kjb job, which is in <pentaho-home>/design-tools/data-integration/samples/jobs.
  4. Select File > Save As, then save the file as Spark Submit Sample.kjb.

spoonide.png

Submit the Spark Job

To submit the Spark job, complete the following steps.

  1. Open the Spark PI job entry.  Spark PI is the name given to the Spark Submit entry in the sample.
  2. In the Job Setup tab, indicate the path to the spark-submit utility in the Spark Submit Utility field.  It is located in where you installed the Spark client.
  3. Indicate the path to your Spark examples jar (either the local version or the one on the cluster in the HDFS) in the Application Jar field.  The Word Count example is in this jar.
  4. In the Class Name field, add the following: org.apache.spark.examples.JavaWordCount.
  5. We recommend that you set the Master URL to yarn-client.  To read more about other execution modes, see https://spark.apache.org/docs/1.2.1/submitting-applications.html
  6. In the Arguments field, indicate the path to the file you want to run Word Count on. 
  7. Click the OK button. 
  8. Save the job.
  9. Run the job.  As the program runs, you will see the results of the word count program in the Execution pane.