Skip to main content
Pentaho Documentation

Spark Submit

Description

Apache Spark is an open-source cluster computing framework. With the Spark Submit entry, you can submit Spark jobs to CDH clusters version 5.3 and later, HDP 2.3 and later, and EMR 3.10 and later. The Spark job you submit may be written in either Java, Scala, or Python.

Install and Configure Spark Client for PDI Use

Before you use this entry, you will need to install and configure a Spark client on any node from which you will run Spark jobs.  

Before You Begin

Configuring the Spark Client

You will need to configure the Spark client to work with the cluster on every machine where sparks jobs can be run from. Complete these steps.

  1. On the client, download the Spark distribution of the same or higher version as the one used on the cluster.
  2. Set the HADOOP_CONF_DIR environment variable to the following: pentaho-big-data-plugin/hadoop-configurations/<shim directory>
  3. Navigate to <SPARK_HOME>/conf/ and create the spark-defaults.conf file using the instructions here: https://spark.apache.org/docs/latest/configuration.html
  4. In the spark-defaults.conf file, add the following line of code. If necessary, adjust the HDFS name and location to match the path to the spark-assembly.jar in your environment. Here are a couple of examples:
  • CDH Example: spark.yarn.jar hdfs://nn1.example.com/user/spark/share/lib/spark-assembly.jar 
  • HDP Example: spark.yarn.jar hdfs://nn1.example.com/user/spark/hadoop27/spark-assembly.jar
  1. Create home folders with write permissions for each user who will be running the Spark job. For example:
  • hadoop fs -mkdir /user/<user name>
  • hadoop fs -chown <user name> /user/<user name>
  1. If you are connecting to an HDP cluster, add the following lines in the spark-defaults.conf file:
  • spark.driver.extraJavaOptions -Dhdp.version=2.7.1.2.3.0.0-2557
  • spark.yarn.am.extraJavaOptions -Dhdp.version=2.7.1.2.3.0.0-2557

The Hadoop version should be the same as Hadoop version used on the cluster.

  1. If you are connecting to a supported version of the HDP cluster, the CDH 5.5 cluster, or the CDH 5.7 cluster; open the core-site.xml file, then comment out the net.topology.script.file property like this:
<!--
<property>
<name>net.topology.script.file.name</name>
<value>/etc/hadoop/conf/topology_script.py</value>
</property>
-->   

Troubleshooting

If you are connecting to CDH 5.7 cluster when using Apache Spark 1.6.0 on your client node, an error may occur while trying to run a job containing a Spark Submit entry in yarn-client mode. This error will be similar to the following message:

  • Caused by: java.io.InvalidClassException: org.apache.spark.rdd.MapPartitionsRDD; local class incompatible: stream classdesc serialVersionUID = -1059539896677275380, local class serialVersionUID = 6732270565076291202

Perform one of the following tasks to resolve this error:

  • Install and configure CDH 5.7 Spark on the client machine where Pentaho is running instead of Apache Spark 1.6.0. See Cloudera documentation for Spark installation instructions.
  • If you want to use Apache Spark 1.6.0 on a client machine, then upload spark-assembly.jar from the client machine to your cluster in HDFS, and point the spark.yarn.jar property in the spark-defaults.conf file to this uploaded spark-assembly.jar file on HDFS.

Spark Submit Entry Properties

We support the yarn-cluster and yarn-client modes. Descriptions of the modes can be found here:

If you have configured your Hadoop Cluster and Spark for Kerberos, a valid Kerberos ticket must already be in the ticket cache area on your client machine before you launch and submit the Spark Submit job.

Job Setup

Field Description
Entry Name Name of the entry. You can customize this, or leave it as the default.
Spark Submit Utility Script that launches the spark job.
Master URL The master URL for the cluster. Two options are supported: 
  • Yarn-Cluster, which runs the driver program as a thread of the yarn application master (one of the node managers in the cluster). This is very similar to the way MapReduce works.
  • Yarn-Client, which runs the driver program on the yarn client. Tasks are still executing in the node managers of the YARN cluster.
Type The file type of your Spark job to be submitted. Your job can be written in Java, Scala, or Python. The fields displayed in the Files tab will depend on what language option you select.
Enable Blocking This option is enabled by default. If this option is selected, the job entry waits until the Spark job finishes running. If it is not, the job entry proceeds with its execution once the Spark job is submitted for execution.

Python support on Windows requires Spark version 1.5.2 or higher.

Files Tab

If you select Java or Scala as the file Type, the Files tab will contain the following fields:

Field Description
Class Optional entry point for your application.
Application Jar The main file of the Spark job you are submitting. It is a path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
Dependencies The Environment and Path of other packages, bundles, or libraries used as a part of your Spark job. Environment defines whether these dependencies are Local to your machine or Static on the HDFS or the web.

If you select Python as the file Type, the Files tab will contain the following fields:

Field Description
Py File The main Python file of the Spark job you are submitting.
Dependencies The Environment and Path of other packages, bundles, or libraries used as a part of your Spark job. Environment defines whether these dependencies are Local to your machine or Static on the HDFS or the web.

Arguments Tab

Field Description
Arguments Arguments passed to your main Java class, Scala class, or Python Py file, if any. Use this text box to specify these arguments.

Options Tab

Field Description
Executor Memory Amount of memory to use per executor process. Use the JVM format (for example, 512m2g).
Driver Memory Amount of memory to use per driver. Use the JVM format (for example, 512m2g).
Utility Parameters Name and Value of optional Spark configuration parameters associated with the spark-defaults.conf file.