Skip to main content
Hitachi VantaraPentaho Documentation
Pentaho Documentation

Setting Up the Adaptive Execution Layer (AEL)

Pentaho uses the Adaptive Execution Layer (AEL) for running transformations in different engines. AEL adapts steps from a transformation developed in PDI to native operators in the engine you select for your environment, such as Spark in a Hadoop cluster. The AEL daemon builds a transformation definition in Spark, which moves execution directly to the cluster.

Your installation of Pentaho 7.1 includes the AEL daemon which you can set up for production to run on your clusters. After you configure the AEL daemon, the PDI client connects to the ZooKeeper server to communicate with both your Spark cluster and the AEL daemon, which lives on a node of your cluster to launch and run transformations. For production, you will need to disable the embedded ZooKeeper which ships with the product and set up AEL to use your own ZooKeeper server. 

Before you can select the Spark engine through run configurations, you will need to configure AEL for your system and your workflow. Depending on your deployment, you may need to perform additional configuration tasks, such as pointing the AEL daemon to your own ZooKeeper server or setting up AEL in a secure cluster.

Before You Begin...

You must meet the following requirements for using the AEL daemon and operating the Spark engine for transformations:

Pentaho 7.1 Installation

When you install the Pentaho Server, the AEL daemon is installed in the folder data-integration/adaptive-execution. This folder will be referred to as 'PDI_AEL_DAEMON_HOME'.

Spark Client

The Spark client is required for the operation of the AEL daemon. Perform the following steps to install the Spark client.

  1. Download the Spark client, spark-2.1.0-bin-hadoop2.7.tgz, from http://spark.apache.org/downloads.html.
  2. Extract it to a folder where the  daemon can access it. This folder will be referred to as the variable 'SPARK_HOME'

Pentaho Spark Application

The Pentaho Spark application is built upon PDI's Kettle engine, which allows transformations to run unaltered within a Hadoop cluster. Some third-party plugins, such as those plugins available in the Pentaho Marketplace, may not be included by default within the Pentaho Spark application. To address this issue, we include the Spark Application builder tool so you can customize the Pentaho Spark application by adding or removing components to fit your needs. 

After running the Spark application builder tool, copy and unzip the resulting pdi-spark-driver.zip file to an edge node in your Hadoop cluster. The unpacked contents consist of the data-integration folder and the pdi-spark-executor.zip file, which includes only the required libraries needed by the Spark nodes themselves to execute a transformation when the AEL daemon is configured to run in YARN mode. Since this zip file needs to be accessible by all nodes in the cluster, it must be copied into HDFS.

Perform the following steps to run the Spark application build tool and manage the resulting files.

  1. Ensure that you have configured your PDI client with all the plugins that you will use.
  2. Navigate to the design-tools/data-integration folder and locate the Spark-app-builder.bat (Windows) or the spark-app-builder.sh (Linux).
  3. Execute the Spark application builder tool script. A console window will display and the pdi-spark-driver.zip file will be created in the data-integration folder (unless otherwise specified by the -outputLocation parameter described below). 

    The following parameters can be used when running the script to build the pdi-spark-driver.zip.

    Parameter Action
    –h or --help Displays the help.
    –e or --exclude-plugins Specifies plugins from the data-integration/plugins folder not to exclude from the assembly.
    –o or --outputLocation Specifies the output location.

     

  4. The pdi-spark-driver.zip file contains a data-integration folder and pdi-spark-executor.zip file. Copy the data-integration folder to the edge node where you want to run the AEL daemon.  
  5. Copy the pdi-spark-executor.zip file to the HDFS node where you will run Spark and extract the contents. This folder will be referred to as 'HDFS_SPARK_EXECUTOR_LOCATION'

For the cluster nodes to use the functionality provided by PDI plugins when executing a transformation, they must be installed into the PDI client prior to generating the Pentaho Spark application. If you install other plugins later, you must regenerate the Pentaho Spark application.

Configuring the AEL Daemon for Production

To set up the AEL daemon for a production system, you will need to perform several configuration tasks.

In the setenv script, set the required environment property values. Since the pdi-daemon script calls the setenv script each time it runs, these values will always be available. You can also configure environment properties for the AEL daemon in the pdi-daemon script itself. Making configuration changes in each script file is described below.

The pdi-daemon script is only supported in UNIX-based environments.

To view a full list of properties you can set in the setenv and pdi-daemon scripts, see Configurable Properties.

Set Properties in the Setenv Script

The setenv script runs each time the AEL daemon is started and sets the values for the environment properties. In the setenv file, the SPARK_HOME and the SPARK_APP variable values must be manually entered. Perform the following steps to set these values.

  1. Navigate to the 'PDI_AEL_DAEMON_HOME' folder and open the setenv file.
  2. Set the following properties with the appropriate values for your system. These are required properties for running the AEL daemon.
Property Description
SPARK_HOME Location of your Apache Spark Client distribution.
SPARK_APP Location of your Pentaho Spark application.
  1. Save and close the file.

Set Properties in the PDI-Daemon Script

Pentaho supplies the pdi-daemon script file to configure the environment properties for the AEL daemon. You can modify the following three configuration files using the pdi-daemon script:

  • AEL daemon configuration file. Set optional environment properties for the AEL daemon.
  • ZooKeeper configuration file. Disable the embedded Pentaho ZooKeeper server shipped with AEL and point to your cluster's ZooKeeper server.
  • Java Authentication and Authorization Service (JAAS) configuration file. Configure the AEL daemon to run in a secure cluster. 

You can modify one or multiple files by using the config command and specifying the file to modify with the following options:

Option Description
--daemon <property name> <property value> Add or modify the named property with the value specified value. To view a full list of properties, see Configurable Properties.
--jaas Specify the JAAS configuration file for listing/editing.
--zookeeper <client|server [enable|disable]> Disables the embedded Pentaho ZooKeeper server shipped with AEL. Enables your cluster's ZooKeeper.

The following options specify what action you can take with the file you specify:

Option Action
-l or --list Lists all the configuration options in the file.
-e or --edit Opens the file in an editor.

Example 1: Enter the following command to view the current AEL daemon configuration:

./pdi-daemon config --daemon --list

Example 2: Enter the following command to view the current ZooKeeper server configuration:

./pdi-daemon config --zookeeper server --list

Customizing Your AEL Configuration

The property values to use for the custom configuration of the daemon can exist as either environmental variables or as shell variables.  You can set an environment variable as a property value using the config command with the following options:

Option Action
-q or --quiet Reads existing environment variables and writes them to the appropriate configuration files.
-s or --secure [<keytab_path>] [<principal>] [<jaas_path>] Sets up a secure cluster.
--reset Restores the parameters to their default setting.

Running the AEL Daemon

You can manually start the AEL daemon by running the pdi-daemon startup script. By default, this startup script is installed in the folder data-integration/adaptive-execution, which is referred to as the variable 'PDI_AEL_DAEMON_HOME'

Perform the following steps to manually start the AEL daemon. 

  1. Navigate to 'PDI_AEL_DAEMON_HOME'. 
  2. Run the pdi-daemon script.

The startup script supports the following commands:

Command Action
start Starts the daemon.
stop Stops the daemon.
status Reports the status of the daemon.

Additionally, the startup script supports the following options:

Option Result
-i or --interactive Starts the daemon in the foreground.

Daemon Startup Examples

The following table lists some examples of how to use the startup script:

Command Result
./pdi-daemon start Starts the AEL daemon with the default configuration.
./pdi-daemon –i start Starts the AEL daemon with the interactive option.

Running in YARN

Perform the following steps to configure the AEL daemon to run with YARN.

  1. Copy the pdi-spark-executor.zip file into HDFS, as in the example below.
$ hdfs dfs put pdi-spark-executor.zip /opt/pentaho/pdi-spark-executor.zip
  1. Navigate to PDI_AEL_DAEMON_HOME/etc folder and open the org.pentaho.pdi.engine.daemon.cfg file. In this file, modify the following property values:

Property Value
sparkMaster yarn
hadoopConfDir

Copy the yarn-clientconfig file from the cluster, put the path to the yarn-conf  file here.

assemblyZip hdfs:$HDFS_SPARK_EXECUTOR_LOCATION

Running the AEL Daemon with Your Own ZooKeeper Server

To run the AEL daemon with your cluster's ZooKeeper server, you must disable the embedded ZooKeeper server shipped with the product and reconfigure the daemon to use your cluster's ZooKeeper server. You can accomplish these tasks by running the following pdi-daemon script:

$ ./pdi-daemon config --zookeeper server disable
pdi-daemon: disabling feature aries-rsa-discovery-zookeeper-server in /Users/user/Pentaho/data-integration/adaptiveexecution/./etc/org.apache.karaf.features.cfg
$ ./pdi-daemon config --zookeeper client zookeeper.host 127.0.0.1
$ ./pdi-daemon config --zookeeper client zookeeper.port 2181
$ ./pdi-daemon config --zookeeper client zookeeper.timeout 3000
$ ./pdi-daemon config --zookeeper client -l
zookeeper.host=127.0.0.1
zookeeper.port=2181
zookeeper.timeout=3000

Running AEL with Spark in a Secure Cluster

By default, the AEL daemon works in an unsecure cluster. To enable security, configure the AEL daemon to work in a secure cluster using impersonation. This configuration requires the use of a proxy user and modifications to the default ZooKeeper configuration. You must configure the AEL daemon to impersonate a proxy user when authenticating to your secure cluster, which is managed by Kerberos.

Complete the following steps to set up secure impersonation while running your transformations in the Spark engine.

  1. Set Up the Proxy User
  2. Modify the Daemon ZooKeeper Configuration

  3. Configure Java Authentication and Authorization Service (JAAS) for Kerberos

Set Up the Proxy User

The following steps associate a proxy user with the AEL daemon:

  1. Navigate to the org.pentaho.pdi.engine.daemon.cfg file.
  2. Edit the file to set the parameters for your cluster as shown in the table below.
    Parameter Value
    keytabName Name of keytab used for the Kerberos principal.
    kerberosPrincipal Name of the Kerberos principal that has the authority to impersonate another user.
    disableProxyUser Optionally, set to true to disable the proxy user. The acting user will be the kerberosPrincipal. This value is set to false by default.

Modify the Daemon ZooKeeper Configuration

By default, the AEL daemon includes an embedded Pentaho ZooKeeper server shipped with the product. For secure impersonation, the AEL daemon must be redirected to your cluster's ZooKeeper server. The org.apache.aries.rsa.discovery.zookeeper.cfg file must be updated in two separate locations to redirect the AEL daemon to the cluster's ZooKeeper. After the AEL daemon has been redirected, the embedded Pentaho ZooKeeper server shipped with the product must be disabled.

Perform the following steps to modify the ZooKeeper configuration and disable the embedded  Pentaho ZooKeeper server shipped with the product.

  1. Navigate to the /data-integration/system/karaf/etc/org.apache.aries.rsa.discovery.zookeeper.cfg file on the Pentaho Server.
  2. Open the file and add the fully qualified host name and port number of your cluster's ZooKeeper server. 

Configure Java Authentication and Authorization Service (JAAS) for Kerberos

The AEL daemon uses the Java Authentication and Authorization Service (JAAS) when run in a secure cluster. To enable secure impersonation, the AEL daemon and the JAAS must be configured for Kerberos authentication.

  1. Copy the keytab file to the PDI_AEL_DAEMON_HOME/keytab folder.
  2. In the PDI_AEL_DAEMON_HOME/jaas folder, create a jaas.conf file.
  3. Modify the jaas.conf file with the following Client entry:
Client { 
        com.sun.security.auth.module.Krb5LoginModule required debug = true 
        useKeyTab = true 
        keyTab = "<path to keytab file>" 
        storeKey = true 
        useTicketCache = false 
        principal = "exampleUser@EXAMPLE.COM"; 
}; 
  1. Ensure that the keyTab property in the jaas.conf file includes the location of the keytab file as shown in the example above. 
  2. In the PDI_AEL_DAEMON_HOME/bin/karaf file, add an environment property variable to indicate the location of the jaas.conf file, as shown in the following example:
-Djava.security.auth.login.config=<location of jaas.conf>
The modified karaf file with the environment variable: DEFAULT_JAVA_OPTS=" 
        -Xms${JAVA_MIN_MEM} 
        -Xmx${JAVA_MAX_MEM} 
        -XX:+UnlockDiagnosticVMOptions 
        -XX:+UnsyncloadClass 
        -Djava.security.auth.login.config=<path to jaas.conf>"
  1. Ensure that the cluster hostname and IP address is accessible from the AEL daemon, or added to the hosts file, if necessary.
  2. Ensure that the PDI client has the entry for Kerberos.

Configurable Properties

The following table lists the properties that can be configured for the AEL daemon in the pdi-daemon script.

Property Description

ASSEMBLY_ZIP

Location of the Spark Assembly file, this can be an HDFS reference.

DAEMON_DEBUG

Enables AEL daemon debug mode. The values are true and false.

DISABLE_PROXY_USER

Disable the proxy user. The values are true and false.

HADOOP_CONF_DIR

Location where the  *-site.xml files reside for Hadoop configuration.

HADOOP_USER

Hadoop user name.

KARAF_DEBUG_PORT

Port to use for remote debugging of the Karaf Server.

KERBEROS_PRINCIPAL

User that will authenticate to the cluster.

KEYTAB_NAME

Path to the user’s keytab.

SPARK_APP

Location of the Spark Karaf assembly.

SPARK_DEPLOY_MODE

Method of the Spark deployment. The values are server and client.

SPARK_DRIVER_DEBUG_PORT

Port to use for remote debugging of the Spark driver.

SPARK_DRIVER_JAVA_OPTS

 JVM Options to be used with the Spark driver.

SPARK_DRIVER_MEMORY

Amount of memory to allocate to the spark driver.

SPARK_EXECUTOR_DEBUG_PORT

Port to use for remote debugging of the Spark executor.

SPARK_EXECUTOR_JAVA_OPTS

 JVM Options to be used with the Spark executor.

SPARK_EXECUTOR_MEMORY

Amount of memory to allocate to the spark executor.

SPARK_HOME

Location of the Apache Spark Client distribution.

SPARK_MASTER

Location where Spark will execute. The values are local and YARN.

SUSPEND_DEBUG

Suspend the JVM to wait for debugger to attach when DAEMON_DEBUG is enabled.

The properties that can be configured for the ZooKeeper server include the following:

  • ZOOKEEPER_CLIENT_PORT
  • ZOOKEEPER_TICK_TIME
  • ZOOKEEPER_INIT_LIMIT
  • ZOOKEEPER_SYNC_LIMIT
  • ZOOKEEPER_DATA_DIR

The properties that can be configured for the ZooKeeper client include the following:

  • ZOOKEEPER_HOST
  • ZOOKEEPER_PORT
  • ZOOKEEPER_TIMEOUT

The properties that can be configured for the JAAS configuration when using Kerberos authentication include the following:

  • KEYTAB_PATH
  • JAAS_CFG

Next Steps

You can now test your AEL configuration by creating a run configuration using the Spark engine. Refer to Run Configurations for more details.