Spark Submit

Before you begin

Before you install Spark, we strongly recommend that you review the following Spark documentation, release notes, and known issues first:

You may also want to reference the https://spark.apache.org/docs/2.3.2/submitting-applications.html instructions about how to submit jobs for Spark.

Install and configure Spark client for PDI use

You will need to install and configure the Spark client on every machine where you want to run Spark jobs using PDI. The configuration of the client depends on your version of Spark. Pentaho supports Cloudera Distribution of Spark (CDS) versions 2.3.x and 2.4.x.

CautionPentaho does not support version 2.4.2 because it does not support Scala version 2.11.

Spark version 2.x.x

Complete the following steps to install and configure the Spark client:

Procedure

On the client, download the Spark distribution of the same or higher version as the one used on the cluster.
Set the HADOOP_CONF_DIR environment variable to a folder containing cluster configuration files as shown in the following sample for an already-configured driver:
- <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name>
Navigate to <SPARK_HOME>/conf and create the spark-defaults.conf file using the instructions outlined in https://spark.apache.org/docs/latest/configuration.html.
Create a ZIP archive containing all the JAR files in the SPARK_HOME/jars directory.
Copy the ZIP file from the local file system to a world-readable location on the cluster.
Edit the spark-defaults.conf file to set the spark.yarn.archive property to the world-readable location of your ZIP file on the cluster as shown in the following examples:
- spark.yarn.archive hdfs://NameNode hostname:8020/user/spark/lib/your ZIP file
Add the following line of code to the spark-defaults.conf file:
- spark.hadoop.yarn.timeline-service.enabled false
If you are connecting to an HDP cluster, add the following lines in the spark-defaults.conf file:
- spark.driver.extraJavaOptions -Dhdp.version=2.3.0.0-2557
- spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.0.0-2557
NoteThe -Dhdp version should be the same as Hadoop version used on the cluster.
If you are connecting to an HDP cluster, also create a text file named java-opts in the <SPARK_HOME>/conf folder and add your HDP version to it as shown in the following example:
- -Dhdp.version=2.3.0.0-2557
NoteRun the hdp-select status Hadoop client command to determine your version of HDP.
If you are connecting to a supported version of the HDP or CDH cluster, open the core-site.xml file, then comment out the net.topology.script.file property as shown in the following code block:
```
 
```

Results

The Spark client is now ready for use with Spark Submit in PDI.

General

Set up information for the Spark Submit job entry is detailed below. For additional information about configuring the Spark-submit utility tools, see the Cloudera documentation.

The following table describes the fields for setting up your Spark job:

Field	Description
Entry Name	Specify the name of the entry. You can customize it or leave it as the default.
Spark Submit Utility	Specify the name of the script that launches the Spark job, which is the batch/shell file name of the underlying spark-submit tool. For example, `Spark2-submit`.
Master URL	Choose a master URL for the cluster from the drop-down: yarn-cluster: Runs the driver program as a thread of the YARN application master (one of the node managers in the cluster). This option is similar to the way MapReduce works. yarn-client: Runs the driver program on the YARN client. Tasks are still executing in the node managers of the YARN cluster.
Type	Select the file type of the Spark job you want to submit. Your job can be written in Java, Scala, or Python. The fields displayed in the Files tab will depend on what language option you select. Python support on Windows requires Spark version 2.3.x or higher.
Enable Blocking	Select Enable Blocking to have the Spark Submit entry wait until the Spark job finishes running. If this option is not selected, the Spark Submit entry proceeds with its execution once the Spark job is submitted for execution. Blocking is enabled by default.

We support the yarn-cluster and yarn-client modes. For descriptions of the modes, see the Spark documentation.

NoteIf you have configured your Hadoop cluster and Spark for Kerberos, a valid Kerberos ticket must already be in the ticket cache area on your client machine before you launch and submit the Spark Submit job.

Options

The Spark Submit entry features several tabs with fields. Each tab is described below.

Files tab

The fields of this tab depend on whether you set the Spark job Type to Java or Scala or Python.

Java or Scala

Files tab, Java or Scala, Spark Submit

If you select Java or Scala as the file Type, the Files tab will contain the following options:

Option	Description
Class	Optionally, specify the entry point for your application.
Application Jar	Specify the main file of the Spark job you are submitting. It is a path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
Dependencies	Specify the Environment and Path of other packages, bundles, or libraries used as a part of your Spark job. Environment defines whether these dependencies are Local to your machine or Static on the cluster or the web.

Python

Files tab, Python, Spark Submit

If you select Python as the file Type, the Files tab will contain the following options:

Option	Description
Py File	Specify the main Python file of the Spark job you are submitting.
Dependencies	Specify the Environment and Path of other packages, bundles, or libraries used as a part of your Spark job. Environment defines whether these dependencies are Local to your machine or Static on the cluster or the web.

Arguments tab

Arguments tab, Spark Submit

Enter the following information to pass arguments to the Spark job:

Option	Description
Arguments	Specify and arguments passed to your main Java class, Scala class, or Python Py file through the text box provided.

Options tab

Options tab, Spark Submit

Enter the following information to adjust the amount of memory used or define optional parameters:

Option	Description
Executor Memory	Specify the amount of memory to use per executor process. Use the JVM format (for example, `512m` or `2g`).
Driver Memory	Specify the amount of memory to use per driver. Use the JVM format (for example, `512m` or `2g`).
Utility Parameters	Specify the Name and Value of optional Spark configuration parameters associated with the spark-defaults.conf file. To set up job entries on secure clusters, see Use Kerberos with Spark Submit.

Troubleshooting your configuration

Errors may occur when you are trying to run a Spark Submit job entry:

If execution of your Spark application was unsuccessful within PDI, then verify and validate the application by running the Spark-submit command line tool in a Command Prompt or Terminal window on the same machine that is running PDI.
If you want to view and track the Spark jobs that you have submitted, you can use the YARN ResourceManager Web UI to review the resources that were used, the duration, and access to additional log info.

Running a Spark job from a Windows machine

The following errors may occur when running a Spark Submit job from a Windows machine:

ERROR yarn.ApplicationMaster: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver! - JobTracker's log
Stack trace: ExitCodeException exitCode=10 - Spoon log

To resolve these errors, create a new rule in the Windows firewall settings to enable inbound connections from the cluster.

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.