Pentaho uses an Adaptive Execution Layer (AEL) for running transformations with Spark. AEL adapts steps from the transformation you develop in PDI to use the native operator functions in Spark. This adaptation is necessary because the Spark engine runs big data transformations in the Hadoop cluster differently than the Pentaho engine. For example, the Spark engine may not require some fields in the PDI step, or it may require setting a precise value setting for an option. Also, null values must be adjusted because Spark processes null values differently than the Pentaho engine.
Some PDI steps commonly used in big data transformations are specifically coded to the Spark APIs for improved performance when using the Spark engine. To see whether the step you want to use has been optimized for distributed processing with Spark, refer the documentation for that step. You can also view the list of Recommended PDI steps to use with Spark on AEL.
To decide whether the Spark engine or the Pentaho engine is the best choice for your transformation, you must know what cluster resources you have and the size of your data sets.
Set Up AEL
AEL must be configured before using the Spark engine in the run configuration of your transformation. Refer your Pentaho or IT administrator to Setting Up the Adaptive Execution Layer for more details.
Once configured, you can select the Spark engine for the transformation. See Run Configurations for more details.
Vendor-specific setups for Spark
The following PDI big data steps have vendor-specific setups or specific vendor versions that are required when running the steps on Spark:
The following topics extend your knowledge of the Adaptive Execution Layer beyond basic setup and use:
- Specify Additional Spark Properties
You can define additional Spark properties within the application.properties file or as run modification parameters within a transformation.
- Configuring AEL with Spark in a
If your AEL daemon server and your cluster machines are in a secure environment like a data center, you may only want to configure a secure connection between the PDI client and the AEL daemon server.
- AEL logging
Pentaho provides logging for transformation and jobs which are executed on the Adaptive Execution Layer.
- Spark Tuning
Spark tuning parameters are available for PDI steps where it is functionally applicable.