Set up the Adaptive Execution Layer (AEL)
Pentaho uses the Adaptive Execution Layer for running transformations in different engines. AEL adapts steps from a transformation developed in PDI to native operators in the engine you select for your environment, such as Spark in a Hadoop cluster. The AEL daemon builds a transformation definition in Spark, which moves execution directly to the cluster.
Your installation of Pentaho includes the AEL daemon which you can set up for production to run on your clusters. After you configure the AEL daemon, the PDI client communicates with both your Spark cluster and the AEL daemon, which lives on a node of your cluster to launch and run transformations.
Before you can select the Spark engine through run configurations, you will need to configure AEL for your system and your workflow. Depending on your deployment, you may need to perform additional configuration tasks, such as setting up AEL in a secure cluster.
Before you begin
You must meet the following requirements for using the AEL daemon and operating an alternative engine for transformations:
- Pentaho 8.3 or later installation. See Pentaho installation.
- Cloudera 5.13 or later or Hortonworks 2.6 or later distribution of Hadoop.
- Spark Client 2.2 or later.
- Pentaho Spark application.
- If you are configuring AEL for use with Cloudera, Hortonworks, MapR, or Amazon EMR, review Vendor-Supplied Clients.
Pentaho installation
When you install the Pentaho Server, the AEL daemon is installed in the folder data-integration/adaptive-execution. This folder will be referred to as PDI_AEL_DAEMON_HOME.
Spark client
Perform the following steps to install the Spark client:
Procedure
Download the Spark client, spark-2.2.0-bin-hadoop2.7.tgz, from http://spark.apache.org/downloads.html.
Extract it to a folder on the cluster where the daemon can access it. This folder will be referred to as the variable SPARK_HOME.
Pentaho Spark application
After running the Spark application builder tool, copy and unzip the resulting pdi-spark-driver.zip file to an edge node in your Hadoop cluster. The unpacked contents consist of the data-integration folder and the pdi-spark-executor.zip file, which includes only the required libraries needed by the Spark nodes themselves to execute a transformation when the AEL daemon is configured to run in YARN mode. Since the pdi-spark-executor.zip file needs to be accessible by all nodes in the cluster, it must be copied into HDFS. Spark distributes this ZIP file to other nodes and then automatically extracts it.
Perform the following steps to run the Spark application build tool and manage the resulting files:
Procedure
Ensure that you have configured your PDI client with all the plugins that you will use.
Navigate to the design-tools/data-integration folder and locate the spark-app-builder.bat (Windows) or the spark-app-builder.sh (Linux).
Execute the Spark application builder tool script. A console window will display and the pdi-spark-driver.zip file will be created in the data-integration folder (unless otherwise specified by the -outputLocation parameter described below).
The following parameters can be used when running the script to build the pdi-spark-driver.zip.
Parameter Action –h or --help Displays the help. –e or --exclude-plugins Specifies plugins from the data-integration/plugins folder not to exclude from the assembly. –o or --outputLocation Specifies the output location. The pdi-spark-driver.zip file contains a data-integration folder and pdi-spark-executor.zip file. Copy the data-integration folder to the edge node where you want to run the AEL daemon.
Copy the pdi-spark-executor.zip file to the HDFS node where you will run Spark. This folder will be referred to as HDFS_SPARK_EXECUTOR_LOCATION.
Next steps
Configure the AEL daemon for local mode
You can configure the AEL daemon to run in Spark local mode for development or demonstration purposes. This will let you build and test a Spark application on your desktop with sample data, then reconfigure the application to run on your clusters.
To configure the AEL daemon for a local mode, complete the following steps:
Procedure
Navigate to the data-integration/adaptive-execution/config directory and open the application.properties file.
Set the following properties for your environment:
Set the sparkHome property to the Spark 2 filepath on your local machine.
Set the sparkApp property to the data-integration directory.
Set the hadoopConfDir property to the directory containing the *site.xml files.
Save and close the file.
Navigate to the data-integration/adaptive-execution folder and run the daemon.sh command from the command line interface.
Configure the AEL daemon for YARN mode
The daemon.sh script is only supported in UNIX-based environments.
To configure the AEL daemon for a YARN production environment, complete the following steps.
Procedure
Navigate to the data-integration/adaptive-execution/config directory and open the application.properties file.
Set the following properties for your environment:
Property Value websocketURL The fully-qualified domain name of the node where the AEL daemon is installed. For example, websocketURL=ws://localhost:${ael.unencrypted.port} sparkHome The path to the Spark client folder on your cluster sparkApp The data-integration directory hadoopConfDir The directory containing the *site.xml files. This property value tells Spark which Hadoop/YARN cluster to use. You can download the directory containing the *site.xml files using the cluster management tool, or you can set the hadoopConfDir property to the location in the cluster. hadoopUser The user ID the Spark application will use, if you are not using security. sparkMaster yarn assemblyZip hdfs:$HDFS_SPARK_EXECUTOR_LOCATION Save and close the file.
Copy the pdi-spark-executor.zip file to your HDFS cluster, as in the example below.
$ hdfs dfs put pdi-spark-executor.zip /opt/pentaho/pdi-spark-executor.zip
Run the pdi-daemon startup script, daemon.sh, from the command line interface.
(Optional) Perform the following steps to manually start the AEL daemon.
You can manually start the AEL daemon by running the daemon.sh script. By default, this startup script is installed in the data-integration/adaptive-execution folder, which is referred to as the variable PDI_AEL_DAEMON_HOME.Navigate to the data-integration/adaptive-execution directory.
Run the daemon.sh script.
The daemon.sh script supports the following commands:Command Action daemon.sh Starts the daemon as a foreground process. daemon.sh start Starts the daemon as a background process. Logs are written to the PDI_AEL_DAEMON_HOME/daemon.log file. daemon.sh stop Stops the daemon. daemon.sh status Reports the status of the daemon.
Configure event logging
Perform the following tasks to configure AEL to log events:
Procedure
Navigate to the data-integration/adaptive-execution/config directory and open the application.properties file.
Set the sparkEventLogEnabled property to true.
If this field is missing or set to false, Spark does not log events.Set the sparkEventLogDir property to a directory where you want to store the log.
This can either be a file system directory (for example, file:///users/home/spark-events), or an HDFS directory (for example, hdfs:/usrs/home/spark-events).Set the spark.history.fs.logDirectory property to point to the same event log directory you configured in the previous step.
Results
Next steps
Vendor-supplied clients
Additional configuration steps may be required when using AEL with a vendor’s version of the Spark client.
Cloudera
If your Cloudera Spark client does not contain the Hadoop libraries, you must add the Hadoop libraries to the classpath using the SPARK_DIST_CLASSPATH environment variable, as shown in the following example command:
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
Hortonworks
You can use multiple vendor versions of the Hortonworks Data Platform (HDP) with the PDI client. Note the following guidelines:
- To use the vendor’s version of the Spark client with Hive Warehouse Connector (HWC) on HDP 3.x platforms, you must configure the AEL daemon for the Hive Warehouse Connector.
- To use a vendor’s version of the Spark client with HDP, you must export the version as outlined in Export the Hortonworks Data Platform vendor version variable.
Configuring the AEL daemon for the Hive Warehouse Connector
You can use PDI with the Hive Warehouse Connector (HWC) to access Hive managed tables in
ORC format on HDP 3.x platforms. You can leverage the fine-grained access controls and the
low-latency analytical processing (LLAP) queue by configuring the
application.properties
file of the AEL daemon.
Before you begin
Before you begin, you will need to perform the following tasks.
- Get connection information by downloading and installing Apache Ambari from https://ambari.apache.org/.
- Determine LLAP sizing and setup needed for your Hive LLAP daemon. See https://community.cloudera.com/t5/Community-Articles/Hive-LLAP-deep-dive/ta-p/248893 and https://community.cloudera.com/t5/Community-Articles/LLAP-sizing-and-setup/ta-p/247425 for instructions.
- Set up the Hive LLAP queue on your HDP cluster. See https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/performance-tuning/content/hive_setting_up_llap.html for instructions.
Configure the AEL daemon for the Hive Warehouse Connector
Perform the following steps.
Procedure
Navigate to the
data-integration/adaptive-execution/config
directory and open theapplication.properties
file with any text editor.Set the values for your environment as shown in the following table.
Parameter Value enableHiveConnection
Enables AEL access to Hive tables. Set this value to true
.enableHiveWarehouseConnector
Enables the use of HWC by AEL. Set this value to true
.hiveMetastoreUris
Identifies the location of Hive metastore. Set this value to thrift://<fully qualified hostname>:9083
.spark.sql.hive.hiveserver2.jdbc.url
Identifies the location of the interactive service. Use the value found at Ambari Services > Hive > Summary > HIVESERVER2 INTERACTIVE JDBC URL. spark.datasource.hive.warehouse.metastoreUri
Identifies the location of the Hive metastore. Use the value found at Ambari Services > Hive > CONFIGS > ADVANCED > General > hive.metastore.uris. spark.datasource.hive.warehouse.load.staging.dir
Determines the HDFS temporary directory used for batch writing to Hive. Set this value to /tmp
.NoteEnsure that your HWC users have permissions for this directory.spark.hadoop.hive.llap.daemon.service.hosts
Specifies the name of the LLAP queue. Use the value found at Ambari Services > Hive > CONFIGS > ADVANCED > Advanced hive-interactive-site > hive.llap.daemon.service.hosts. spark.hadoop.hive.zookeeper.quorum
Provides the Hive endpoint to access the Hive tables. Use the value found at Ambari Services > Hive > CONFIGS > ADVANCED > Advanced hive-site > hive.zookeeper.quorum. If you are running on a YARN cluster with Kerberos, set the property for
spark.sql.hive.hiveserver2.jdbc.url.principal
to the Hive principal of the cluster. Use the value found at Ambari Services > Hive > CONFIGS > ADVANCED > Advanced hive-site > hive.server2.authentication.kerberos.principal.Save and close the file.
Create a symbolic link to the HWC JAR file in the
/data-integration/adaptive-execution/extra
directory. For example, if you are in theextra
directory, the following command will create this link:ln -s /usr/hdp/current/hivewarehouseconnector/hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar /<user_name>/data-integration/adaptive-execution/extra/
Save and close the file.
Restart the AEL daemon.
Export the
HDP_VERSION
variable to the edge node where your Pentaho Server resides.
Results
Export the Hortonworks Data Platform vendor version variable
The HDP version on the edge node where your Pentaho Server resides must be the same version used on your cluster. If the versions are different, then the AEL daemon and the PDI client will stop working. To prevent this from happening, you must export the HDP_VERSION variable. For example:
export HDP_VERSION=${HDP_VERSION:-2.6.0.3-8}
You can check the HDP version on your cluster using the following command:
hdp-select status hadoop-client
MapR
Procedure
Navigate to the data-integration/adaptive-execution/config directory and open the application.properties file.
Set the following property for your MapR Spark 2.2 environment:
Property Value hadoopConfDir This property identifies the Hadoop cluster that Spark will use. Because MapR identifies the Hadoop cluster by default, set the property value to empty, as shown here:
hadoopConfDir=""-Dhadoop.login This property identifies the security environment that the Hadoop cluster will use. If you enable security, the value of the MAPR_ECOSYSTEM_LOGIN_OPTS environment variable will include the hybrid JVM option for the hadoop.login property.
Set the property value to hybrid to specify a mixed security environment using Kerberos and internal MapR security technologies as shown here:
-Dhadoop.login=hybrid
-Djava.security.auth.login.config This property identifies the configuration file to use when you enable security. The MapR distribution for Hadoop uses the Java Authentication and Authorization Service (JAAS) to control security features. The /opt/mapr/conf/mapr.login.conf file specifies configuration parameters for JAAS.
Set the property value to /opt/mapr/conf/mapr.login.conf as shown here:
-Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.confSave and close the file.
Before running the daemon, add the Hadoop libraries to the classpath by running the following command from the command prompt (terminal window) on the cluster:
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
Results
Amazon EMR
If you plan to use AEL with Amazon EMR, note the following conditions:
- To use Amazon EMR with AEL, you must install the Linux LZO compression library. See LZO support for more information.
- Because of limitations in Amazon EMR 4.0 and later, Impala is not supported
on Spark. NoteImpala is not available as a download on the EMR Cluster configuration menu.
LZO support
Procedure
Follow the instructions available here to install the Linux LZO compression library from the command line: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_command-line-installation/content/install_compression_libraries.html
Navigate to the data-integration/adaptive-execution/config/ directory and open the application.properties file.
Add the following properties:
- spark.executor.extraClassPath= /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
- spark.driver.extraClassPath = /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
Append the following properties to include -Djava.library.path=/usr/lib/hadoop-lzo/lib/native at the end of each line:
- sparkExecutorExtraJavaOptions
- sparkDriverExtraJavaOptions
Save and close the file.
Advanced topics
The following topics help to extend your knowledge of the Adaptive Execution Layer beyond basic setup and use:
- Specify additional Spark properties
You can define additional Spark properties within the application.properties file or as run modification parameters within a transformation.
- Configuring AEL with Spark in a secure cluster
If your AEL daemon server and your cluster machines are in a secure environment like a data center, you may only want to configure a secure connection between the PDI client and the AEL daemon server.
Troubleshooting
See our list of common problems and resolutions.