Skip to main content
Pentaho Documentation

Set up the Adaptive Execution Layer (AEL)

Parent article

Pentaho uses the Adaptive Execution Layer for running transformations on the Spark Distributive Compute Engine. AEL adapts steps from a transformation developed in PDI to Spark-native operators. The AEL daemon builds a transformation definition in Spark, which moves execution directly to the cluster.

Your installation of Pentaho includes the AEL daemon which you can set up for production to run on your clusters. After you configure the AEL daemon, the PDI client communicates with both your Spark cluster and the AEL daemon, which lives on a node of your cluster to launch and run transformations.

Before you can select the Spark engine through run configurations, you will need to configure AEL for your system and your workflow. Depending on your deployment, you may need to perform additional configuration tasks, such as setting up AEL in a secure cluster.

AEL runs PDI transformations in Spark-centric manner, which is documented for each step using the Spark engine.

Before you begin

You must meet the following requirements for using the AEL daemon and operating the Spark engine for transformations:

CautionSpark does not support having Hadoop 2.x JAR files and Hadoop 3.x JAR files on the same class path as your spark application. Do not to mix and match Hadoop versions.
NoteThe dependency on Zookeeper was removed in Pentaho 8.0. If you installed AEL for Pentaho 7.1, you must delete the adaptive-execution folder and follow the Pentaho 8.0 or later Installation instructions to use AEL with Pentaho 8.0 or later.

Pentaho installation

When you install the Pentaho Server, the AEL daemon is installed in the folder data-integration/adaptive-execution. This folder will be referred to as PDI_AEL_DAEMON_HOME.

Spark client

The Spark client is required for the operation of the AEL daemon. The recommended versions of the Apache Spark client are 2.3 and later.

Verify if you already have a supported Spark client installed. Perform the following steps if you need to install a Spark client:

Procedure

  1. Download the Spark client for your environment from http://spark.apache.org/downloads.html.

    For example, download spark-2.3.0-bin-hadoop2.7.tgz if you are using Spark 2.3 on Hadoop 2.7
  2. Extract the downloaded TGZ file to a folder on the edge node running the AEL daemon.

    This folder will be referred to as the SPARK_HOME variable.

Pentaho Spark application

The Pentaho Spark application is built upon PDI's Pentaho execution engine, which allows you to develop Spark applications with familiar Pentaho tools. Some third-party plugins, such as those plugins available in the Pentaho Marketplace, may not be included by default within the Pentaho Spark application. To address this issue, we include functionality in the Spark Application builder tool so you can customize the Pentaho Spark application by adding or removing components to fit your needs.

After running the Spark application builder tool, copy and unzip the resulting pdi-spark-driver.zip file to an edge node in your Hadoop cluster. The unpacked contents consist of the data-integration folder and the pdi-spark-executor.zip file, which includes only the required libraries needed by the Spark nodes themselves to execute a transformation when the AEL daemon is configured to run in YARN mode. Since the pdi-spark-executor.zip file needs to be accessible by all nodes in the cluster, it must be copied into HDFS. Spark distributes this ZIP file to other nodes and then automatically extracts it.

Perform the following steps to run the Spark application build tool and manage the resulting files:

Procedure

  1. Ensure that you have configured your PDI client with all the plugins that you will use.

  2. Navigate to the design-tools/data-integration folder and locate the spark-app-builder.bat (Windows) or the spark-app-builder.sh (Linux).

  3. Execute the Spark application builder tool script.

    A console window will display and the pdi-spark-driver.zip file will be created in the data-integration folder (unless otherwise specified by the -outputLocation parameter described below).

    The following parameters can be used when running the script to build the pdi-spark-driver.zip.

    ParameterAction
    –h or --helpDisplays the help.
    –e or --exclude-pluginsSpecifies plugins from the data-integration/plugins folder not to exclude from the assembly.
    –o or --outputLocationSpecifies the output location.

    The pdi-spark-driver.zip file contains a data-integration folder and pdi-spark-executor.zip file.

  4. Copy the data-integration folder to the edge node where you want to run the AEL daemon.

  5. Copy the pdi-spark-executor.zip file to the HDFS node where you will run Spark.

    This folder will be referred to as HDFS_SPARK_EXECUTOR_LOCATION.

Next steps

NoteFor the cluster nodes to use the functionality provided by PDI plugins when executing a transformation, they must be installed into the PDI client prior to generating the Pentaho Spark application. If you install other plugins later, you must regenerate the Pentaho Spark application.

Configure the AEL daemon for local mode

You can configure the AEL daemon to run in Spark local mode for development or demonstration purposes. In local mode, you can build and test a Spark application on your desktop with sample data, then reconfigure the application to run on your clusters.
NoteConfiguring the AEL daemon to run in Spark local mode is not supported, but can be useful for development and debugging.

To configure the AEL daemon for a local mode, complete the following steps:

Procedure

  1. Navigate to the data-integration/adaptive-execution/config directory and open the application.properties file.

  2. Set the following properties for your environment:

    1. Set the sparkHome property to the Spark 2 filepath on your local machine.

    2. Set the sparkApp property to the data-integration directory.

    3. Set the hadoopConfDir property to the directory containing the *site.xml files.

  3. Save and close the file.

  4. Navigate to the data-integration/adaptive-execution folder and run the daemon.sh command from the command line interface.

Configure the AEL daemon for YARN mode

Typically, the AEL daemon is run in YARN mode for production purposes. In YARN mode, the driver application launches and delegates work to the YARN cluster. The pdi-spark-executor application must be installed on each of the YARN nodes.

The daemon.sh script is only supported in UNIX-based environments.

NoteBecause of limitations for CDS Powered by Apache Spark in CDH 6.1, AEL does not support Hive or Impala in YARN mode. If you would like specific information, see the Cloudera documentation.

To configure the AEL daemon for a YARN production environment, complete the following steps.

Procedure

  1. Navigate to the data-integration/adaptive-execution/config directory and open the application.properties file.

  2. Set the following properties for your environment:

    PropertyValue
    websocketURLThe fully-qualified domain name of the node where the AEL daemon is installed. The following command is an example of how to obtain the fully qualified name:
    [devuser@hito31-n2 ~]$ hostname -f
    hito31-n2.cs1cloud.internal

    An example of a fully qualified name is websocketURL=ws://localhost:${ael.unencrypted.port}.

    sparkHomeThe path to the Spark client folder on your cluster
    sparkAppThe data-integration directory
    hadoopConfDirThe directory containing the *site.xml files. This property value tells Spark which Hadoop/YARN cluster to use. You can download the directory containing the *site.xml files using the cluster management tool, or you can set the hadoopConfDir property to the location in the cluster.
    hadoopUserThe user ID the Spark application will use. This user must have permissions to access the file in the Hadoop file system.
    hbaseConfDirThe directory containing the hbase-site.xml file. This property value tells Spark how HBase is configured for your cluster. You can download the directory containing the *site.xml files using the cluster management tool, or you can set the hadoopConfDir property to the location in the cluster.
    sparkMasteryarn
    SparkDeployModeclient
    NoteYARN-cluster deployment mode in YARN is not supported by AEL
    assemblyZiphdfs:$HDFS_SPARK_EXECUTOR_LOCATION
  3. Save and close the file.

  4. Copy the pdi-spark-executor.zip file to your HDFS cluster, as shown in the following example:

    $ hdfs dfs put pdi-spark-executor.zip /opt/pentaho/pdi-spark-executor.zip
  5. Perform the following steps to start the AEL daemon.

    You can start the AEL daemon by running the daemon.sh script. By default, this startup script is installed in the data-integration/adaptive-execution folder, which is referred to as the variable PDI_AEL_DAEMON_HOME.
    1. Navigate to the data-integration/adaptive-execution directory.

    2. Run the daemon.sh script.

      The daemon.sh script supports the following commands:
      CommandAction
      daemon.shStarts the daemon as a foreground process.
      daemon.sh startStarts the daemon as a background process. Logs are written to the PDI_AEL_DAEMON_HOME/daemon.log file.
      daemon.sh stopStops the daemon.
      daemon.sh statusReports the status of the daemon.

Configure event logging

Spark events can be captured in an event log that can be viewed with the Spark History Server. The Spark History Server is a web browser-based user interface to the event log. You can view either running or completed Spark transformations using the Spark History Server. Before you can use the Spark History Server, you must configure AEL to log the events.

Perform the following tasks to configure AEL to log events:

Procedure

  1. Have your cluster administrator enable the Spark History Server on your cluster and give you the location of the Spark event log directory.

  2. Navigate to the data-integration/adaptive-execution/config directory and open the application.properties file.

  3. Set the sparkEventLogEnabled property to true.

    If this field is missing or set to false, Spark does not log events.
  4. Set the sparkEventLogDir property to a directory where you want to store the log.

    This location can either be a file system directory (for example, file:///users/home/spark-events), or an HDFS directory (for example, hdfs:/usrs/home/spark-events).
  5. Set the spark.history.fs.logDirectory property to point to the same event log directory you configured in the previous step.

Results

You can now view Spark-specific information for your PDI transformations using the Spark History Server.

Next steps

Vendor-supplied clients

Additional configuration steps may be required when using AEL with a vendor’s version of the Spark client.

Cloudera

If your Cloudera Spark client does not contain the Hadoop libraries, you must add the Hadoop libraries to the classpath using the SPARK_DIST_CLASSPATH environment variable, as shown in the following example command:

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

NoteBecause of limitations for CDS Powered by Apache Spark in CDH 6.1, AEL does not support Hive or Impala in YARN mode. If you would like specific information, see the Cloudera documentation.

Hortonworks

You can use multiple vendor versions of the Hortonworks Data Platform (HDP) with the PDI client. To use the vendor’s version of the Spark client with Hive Warehouse Connector (HWC) on HDP 3.x platforms, you must configure the AEL daemon for the Hive Warehouse Connector.

To use HBase with AEL and HDP, you must add copies of HBase JAR files to your PDI distribution.

Use HBase with AEL and HDP

To use HBase with AEL and HDP, you must add the HBase JAR files to PDI.

Perform the following steps to add the HBase JAR files:

Procedure

  1. Copy the following files from the /usr/hdp/3.0.0.0-1634/hbase/lib/ directory of your cluster.

    • hbase-client-2.0.0.3.0.0.0-1634.jar
    • hbase-common-2.0.0.3.0.0.0-1634.jar
    • hbase-hadoop-compat-2.0.0.3.0.0.0-1634.jar
    • hbase-mapreduce-2.0.0.3.0.0.0-1634.jar
    • hbase-protocol-2.0.0.3.0.0.0-1634.jar
    • hbase-protocol-shaded-2.0.0.3.0.0.0-1634.jar
    • hbase-server-2.0.0.3.0.0.0-1634.jar
    • hbase-shaded-miscellaneous-2.1.0.jar
    • hbase-shaded-netty-2.1.0.jar
    • hbase-shaded-protobuf-2.1.0.jar
  2. Follow the instructions in Set up the vendor-specified JARs to install the files.

Configuring the AEL daemon for the Hive Warehouse Connector

You can use PDI with the Hive Warehouse Connector (HWC) to access Hive managed tables in ORC format on HDP 3.x platforms. You can leverage the fine-grained access controls and the low-latency analytical processing (LLAP) queue by configuring the application.properties file of the AEL daemon.

Before you begin

Before you begin, you will need to perform the following tasks.

  1. Get connection information by downloading and installing Apache Ambari from https://ambari.apache.org/.
  2. Determine LLAP sizing and setup needed for your Hive LLAP daemon. See https://community.cloudera.com/t5/Community-Articles/Hive-LLAP-deep-dive/ta-p/248893 and https://community.cloudera.com/t5/Community-Articles/LLAP-sizing-and-setup/ta-p/247425 for instructions.
  3. Set up the Hive LLAP queue on your HDP cluster. See https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/performance-tuning/content/hive_setting_up_llap.html for instructions.
Configure the AEL daemon for the Hive Warehouse Connector
You will need to set up the AEL daemon for HWC.
NoteAEL allows you to configure either the Hive Warehouse Connector (HWC) or the JDBC driver option for cluster management. The JDBC driver option is the default configuration.

Perform the following steps.

Procedure

  1. Navigate to the data-integration/adaptive-execution/config directory and open the application.properties file with any text editor.

  2. Set the values for your environment as shown in the following table.

    ParameterValue
    enableHiveConnectionEnables AEL access to Hive tables. Set this value to true.
    hiveMetastoreUrisIdentifies the location of Hive metastore. Set this value to thrift://<fully qualified hostname>:9083.
    spark.sql.hive.hiveserver2.jdbc.urlIdentifies the location of the interactive service. Use the value found at Ambari Services > Hive > Summary > HIVESERVER2 INTERACTIVE JDBC URL.
    spark.datasource.hive.warehouse.metastoreUriIdentifies the location of the Hive metastore. Use the value found at Ambari Services > Hive > CONFIGS > ADVANCED > General > hive.metastore.uris.
    spark.datasource.hive.warehouse.load.staging.dirDetermines the HDFS temporary directory used for batch writing to Hive. Set this value to /tmp.
    NoteEnsure that your HWC users have permissions for this directory.
    spark.hadoop.hive.llap.daemon.service.hostsSpecifies the name of the LLAP queue. Use the value found at Ambari Services > Hive > CONFIGS > ADVANCED > Advanced hive-interactive-site > hive.llap.daemon.service.hosts.
    spark.hadoop.hive.zookeeper.quorumProvides the Hive endpoint to access the Hive tables. Use the value found at Ambari Services > Hive > CONFIGS > ADVANCED > Advanced hive-site > hive.zookeeper.quorum.
    spark.driver.extraClassPathSpecifies the path to the directory containing the hive-site.xml file on the driver node. It causes the hive-site.xml file to be loaded as a resource in the driver. This resource defines the Hive endpoints and security setting required by AEL to access the Hive subsystem.
    spark.executor.extraClassPathSpecifies the path to the directory containing the hive-site.xml on the executor nodes. It causes the hive-site.xml file to be loaded as a resource on each executor. This resource defines the Hive endpoints and security setting required by AEL to access the Hive subsystem.
    The following lines of code show sample values for these parameters:
    # AEL Spark Hive Property Settings
    enableHiveConnection=true
    spark.driver.extraClassPath=/usr/hdp/current/spark2-client/conf/
    spark.executor.extraClassPath=/usr/hdp/current/spark2-client/conf/
    spark.sql.hive.hiveserver2.jdbc.url=jdbc:hive2://hito31-n3.cs1cloud.internal:2181,hito31-n2.cs1cloud.internal:2181,hito31-n1.cs1cloud.internal:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive
    spark.datasource.hive.warehouse.metastoreUri=thrift://hito31-n2.cs1cloud.internal:9083
    spark.datasource.hive.warehouse.load.staging.dir=/user/devuser/tmp
    spark.hadoop.hive.llap.daemon.service.hosts=@llap0
    spark.hadoop.hive.zookeeper.quoruma=hito31-n3.cs1cloud.internal:2181,hito31-n2.cs1cloud.internal:2181,hito31-n1.cs1cloud.internal:2181
  3. Save and close the file.

  4. Create a symbolic link to the HWC JAR file in the /data-integration/adaptive-execution/extra directory. For example, if you are in the extra directory, the following command will create this link:

    ln -s /usr/hdp/current/hivewarehouseconnector/hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar /<user_name>/data-integration/adaptive-execution/extra/
  5. Save and close the file.

  6. Restart the AEL daemon.

  7. Stop, then start the Pentaho Server.

Results

You can now use PDI with HWC to manage Hive tables on HDP 3.x clusters.

Amazon EMR

If you plan to use AEL with Amazon EMR, note the following conditions:

  • To use Amazon EMR with AEL, you must install the Linux LZO compression library. See LZO support for more information.
  • To use Amazon EMR with AEL and Hive, you must Configure the AEL daemon for Hive.
  • To use the HBase Input and HBase Output steps with AEL and Amazon EMR, see Using HBase steps with Amazon EMR 5.21.
  • Because of limitations in Amazon EMR 4.0 and later, Impala is not supported on Spark.
    NoteImpala is not available as a download on the EMR Cluster configuration menu.

LZO support

LZO is a compression format supported by Amazon EMR. It is required for running AEL on EMR. To configure for LZO compression, you will need to add several properties.

Procedure

  1. Follow the instructions available here to install the Linux LZO compression library from the command line: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_command-line-installation/content/install_compression_libraries.html

  2. Navigate to the data-integration/adaptive-execution/config/ directory and open the application.properties file.

  3. Add the following properties:

    • spark.executor.extraClassPath= /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
    • spark.driver.extraClassPath = /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
  4. Append the following properties to include -Djava.library.path=/usr/lib/hadoop-lzo/lib/native at the end of each line:

    • sparkExecutorExtraJavaOptions
    • sparkDriverExtraJavaOptions
  5. Save and close the file.

Use HBase with AEL and Amazon EMR

To use HBase with AEL and Amazon EMR, you must add the HBase libraries to the classpath.

Perform the following steps to add the HBase libraries:

Procedure

  1. Stop the AEL daemon.

  2. From a command prompt (terminal window) on the cluster, run the following command:

    add SPARK_DIST_CLASSPATH=$(hbase classpath)
  3. Start the AEL daemon.

Configure the AEL daemon for Hive

You can use PDI with the Hive to access Hive managed and unmanaged tables in ORC and Parquet formats on your Amazon EMR cluster. You can leverage Hive when using the Table Input and Table Output steps to access Hive tables by configuring the application.properties file of the AEL daemon.

Perform the following steps.

Procedure

  1. Navigate to the data-integration/adaptive-execution/config directory and open the application.properties file with any text editor.

  2. Set the values for your environment as shown in the following table.

    ParameterValue
    enableHiveConnectionEnables AEL access to Hive tables. Set this value to true.
    spark.driver.extraClassPathSpecifies the path to the directory containing the hive-site.xml file on the driver node. It loads the hive-site.xml file as a resource in the driver. This resource defines the Hive endpoints and security setting required by AEL to access the Hive subsystem.
    spark.executor.extraClassPathSpecifies the path to the directory containing the hive-site.xml on the executor nodes. It loads the hive-site.xml file as a resource on each executor. This resource defines the Hive endpoints and security setting required by AEL to access the Hive subsystem.
    The following lines of code show sample values for these parameters:
    # AEL Spark Hive Property Settings
    enableHiveConnection=true
    spark.driver.extraClassPath=/etc/spark/conf.dist/
    spark.executor.extraClassPath=/etc/spark/conf.dist/
  3. Save and close the file.

  4. Restart the AEL daemon.

  5. Stop, then start the Pentaho Server.

Results

You can now use PDI with Hive to manage Hive tables on your EMR clusters.

Google Cloud Storage

This configuration task is intended for Pentaho administrators and Hadoop cluster administrators who want to set up access to Google Cloud Storage (GCS) for PDI transformations running on Spark.

This task assumes that you have obtained the settings for your site's Google Cloud Storage (GCS) configuration from your Hadoop cluster administrator.

Perform the following steps to set up Hadoop cluster access to GCS:

Procedure

  1. Log on to the cluster and stop the AEL daemon by running the shutdown script, daemon.sh stop, from the command line interface.

  2. Download the GCS Hadoop Connector JAR file and save it in a location where you can access it. You can use the following UNIX command to download the GCS Hadoop Connector Jar:

    wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar
  3. Use the following command to add the GCS Hadoop Connector JAR file to the SPARK_DIST_CLASSPATH where /full/path/to is the location where you stored the JAR file:

    export SPARK_DIST_CLASSPATH=$(hadoop classpath):/full/path/to/gcs-connector-hadoop2-latest.jar
  4. Configure your clusters with the GCS connector with Hadoop/Spark using the instructions located in the Google Cloud Platform interoperability GitHub repository: https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md

  5. Configure AEL to use the GCS Hadoop Connector. Possible ways of configuring AEL include on of the following.

    • Adding the GCS properties to the /etc/hadoop/conf/core-site.xml file.
    • Adding JSON keyfile parameters for GCS to the AEL daemon application.properties file. Follow the instructions in Step 6.
      NoteThe JSON keyfile for GCS must be present on all the nodes in the cluster.
  6. (Optional) If you choose to add a JSON keyfile to the application.properties file, follow these steps.

    1. Navigate to the data-integration/adaptive-execution/config/ directory and open the application.properties file with any text editor.

    2. Add the following lines of code:

      • spark.hadoop.google.cloud.auth.service.account.enable=true
      • spark.hadoop.google.cloud.auth.service.account.json.keyfile=/path/to/keyfile.json
    3. Save the file and close it.

  7. Restart the AEL daemon by running the startup script, daemon.sh, from the command line interface.

Advanced topics

The following topics help to extend your knowledge of the Adaptive Execution Layer beyond basic setup and use:

  • Spark Tuning

    You can customize PDI transformation and step parameters to improve the performance of running your PDI transformations on Spark. These parameters affect memory, cores, and instances used by the Spark engine. These Spark parameters include both application parameters and Spark tuning parameters.

  • Configuring AEL with Spark in a secure cluster

    If your AEL daemon server and your cluster machines are in a secure environment like a data center, you may only want to configure a secure connection between the PDI client and the AEL daemon server.

Troubleshooting

See our list of common problems and resolutions.