Follow the suggestions in these topics to help resolve common issues with running transformations with the Adaptive Execution Layer.
- Steps cannot run in parallel
- Table Input step fails
- User ID below minimum allowed
- Hadoop version conflict
- Hadoop libraries are missing
- Spark libraries conflict with Hadoop libraries
- Failed to find AVRO files
- Unable to access Google Cloud Storage resources
- Unable to access AWS S3 resources
- Internet Address data type fails
- Message size exceeded
- Spark SQL catalyst errors using the Merge or Group By steps
- Performance or memory issues
- Multiple steps in a transformation cannot generate files to the same location
Steps cannot run in parallel
If you are using the Spark engine to run a transformation with a step that cannot run in parallel, it generates errors in the log.
Some steps cannot run in parallel (on multiple nodes in a cluster), and will produce unexpected results. However, these steps can run as a coalesced dataset on a single node in a cluster. To enable a step to run as a coalesced dataset, add the step ID as a property value in the configuration file for using the Spark engine.
Get the step ID
Each PDI step has a step ID, a globally unique identifier of the step. Use one of the following methods to get the ID of a step:
Method 1: Retrieve the ID from the PDI client
From the menu bar in the PDI client, selectThe .Plugin browser appears.
Select Step in the Plugin type menu to filter by step name, and find your step name in the table to obtain the related ID.
Method 2: Retrieve the ID from the log
In the PDI client, create a new transformation and add the step to the transformation.For example, if you needed to know the ID for the Select values step, you would add that step to the new transformation.
Set the log level to debug.
Execute the transformation using the Spark engine.The step ID displays in the Logging tab of the Execution Results pane. For example, the log displays Selected the SelectValues step to run in parallel as a GenericSparkOperation, where SelectValues is the step ID.
Method 3: Retrieve the ID from the PDI plugin registry
If you are a developer, you can retrieve the step ID from the PDI plugin registry as described in Dynamically build transformations.
Add the step ID to the configuration file
Perform the following steps to add another step ID to the configuration file:
Navigate to the data-integration/system/karaf/etc folder on the edge node running the AEL daemon and open the org.pentaho.pdi.engine.spark.cfg file.
Append your step ID to the forceCoalesceSteps property value list, using a pipe character separator between the step IDs.
Save and close the file.
Force coalesce and Spark tuning
Any steps to the org.pentaho.pdi.engine.spark.cfg force
coalesce configuration file do a coalesce. If the
application.properties setting is set to true,
then step tuning takes precedence over force coalesce.
Table Input step fails
If you run a transformation using the Table Input step with a large database, the step does not complete. Use one of the following methods to resolve the issue:
Method 1: Load the data to HDFS before running the transform
Run a different transformation using the Pentaho engine to move the data to the HDFS cluster.
Then use HDFS Input to run the transformation using the Spark engine.
Method 2: Increase the driver side memory configuration
Navigate to the config/ folder and open the application.properties file.
Increase the value of the sparkDriverMemory parameter, then save and close the file.
Method 3: Adjust JDBC tuning options
User ID below minimum allowed
If you are using the Spark engine in a secured cluster and an error about minimum user ID occurs, the user ID of the proxy user is below the minimum user ID required by the cluster. See Cloudera documentation for details.
To resolve, change the ID of the proxy user to be higher than the minimum user ID specified for the cluster.
Hadoop version conflict
On an HDP cluster, if you receive the following message, your Hadoop library is in conflict and the AEL daemon along with the PDI client might stop working:
To resolve the issue, you must export the HDP_VERSION variable using a command like the following example:
The HDP version number should match the HDP version number of the distribution on the cluster. You can check your HDP version with the hdp-select status hadoop-client command.
Hadoop libraries are missing
If you use the Spark libraries packaged with EMR, Cloudera, and Hortonworks’ distributions, you must add the Hadoop libraries to the classpath with the SPARK_DIST_CLASSPATH environment variable. These distributions are not packaged with the Hadoop libraries. For EMR, these libraries are required to access S3 resources.
Add the class path
The following command will add the libraries to the classpath:
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
You can add this command to the daemon.sh file so you do not have run this command every time you the start the AEL daemon.
Set Spark home variable
If you received this log error, you must also complete the following steps for your Hadoop distribution:
Download the Spark client for your Hadoop cluster distribution (Cloudera or Hortonworks).
Navigate to the adaptive-execution/config directory and open the application.properties file.
Set the sparkHome location to where Spark 2 is located on your machine.
Example for Cloudera:
sparkhome = /opt/cloudera/parcels/SPARK2/lib/spark2
Example for Hortonworks:
Spark libraries conflict with Hadoop libraries
In some cases, library versions contained in JARs from PDI, Spark, Hadoop, AEL, and/or Kettle plugins may conflict with one another, causing general problems where Spark libraries conflict with Hadoop libraries and potentially creating AEL-specific problems. To read more about this issue, including how to address it, see the article AEL and Spark Library Conflicts on the Pentaho Community Wiki.
Failed to find AVRO files
If you are using the Spark engine with an EMR cluster, you may receive the following error message when trying to access AVRO files:
Failed to find data source: org.apache.spark.sql.avro.AvroFileFormat. Please find packages at http://spark.apache.org/third-party-projects.html
The libraries needed for accessing AVRO files on an EMR cluster are not included in Spark default class path. You must add them to the AEL daemon extra/ directory.
To resolve the issue, copy the vendor-supplied data source JAR libraries on the /usr/lib/spark/external/lib/ directory, such as "file spark-avro_2.11_2.4.2.jar" for example, to the AEL extra/ directory on the daemon, as shown in the following example:
cp /usr/lib/spark/external/lib/spark-avro_2.22_2.4.2.jar $AEL_DAEMON_DIRECTORY/data-integration/adaptive-execution/extra/
Unable to access Google Cloud Storage resources
You might receive an error message when trying to access Google Cloud Storage
(GCS) resources. The URIs starting with
gs://, such as
gs://mybucket/myobject.parquet" for example, require specific cluster or
To resolve the issue, see Google Cloud Storage for instructions.
Unable to access AWS S3 resources
You might receive an error message when trying to access AWS S3 resources. The
URIs starting with
s3a://, such as "
example, require specific cluster configurations.
To resolve the issue for an EMR cluster, see Hadoop libraries are missing for instructions.
To resolve the issue for a Cloudera or Hortonworks cluster, see the following vendor-specific cluster documentation for details:
JAR file conflict in Kafka Consumer step
When using the Kafka Consumer step with HDP 3.x on AEL Spark, there is a known conflict with the JAR file /usr/hdp/3.x/hadoop-mapreduce/kafka-clients-0.8.2.1.jar
Use one of the following solutions to resolve the JAR conflict.
- On HDP 3.x do not set the SPARK_DIST_CLASSPATH variable before running the Adaptive Execution Layer daemon. Otherwise, there may be issues in other AEL components.
- Exclude the JAR file from the path on SPARK_DIST_CLASSPATH with the spark-dist-classpath.sh
script. Create the script with any text editor and include the following
#!/bin/sh ## ## helper script for setting up SPARK_DIST_CLASSPATH for AEL ## removes conflicting JAR files existing in HDP 3.x ## Using: call this the same way you use hadoop classpath, command, i.e.: ## export SPARK_DIST_CLASSPATH=$(spark-dist-classpath.sh) # grab hadoop classpath HCP=`hadoop classpath` ## expand it to grab all jar files ( for entry in `echo "$HCP" | sed -e 's/:/\n/g'` ; do ## clean up dirs ending with * entryCleaned=`echo "$entry" | sed -e 's/\*$//'` ## if dir, expand it if test -d $entryCleaned ; then find $entryCleaned else echo "$entry" fi done ) | grep -v kafka-clients-0.8.2.1.jar | paste -s -d: exit
Internet Address data type fails
When running an AEL transformation using an input step with the data type 'Internet Address' selected for a URL field, your transformation may not complete properly.
When you are using the Spark engine to run an AEL transformation, do not use the data type 'Internet Address' when entering a URL in a step. Instead, use the data type 'String' for the URL.
Message size exceeded
Perform the following steps to increase the message buffer limit:
Navigate to the data-integration/adaptive-execution/config directory and open the application.properties file using a text editor.
Enter the following incoming WebSocket message buffer properties, setting the same value for each property:
Property Value daemon.websocket.maxMessageBufferSize The maximum size (in bytes) for the message buffer on the AEL daemon. For example, to allocate a 4 MB limit, set the memory value as shown here:
driver.websocket.maxMessageBufferSize The maximum size (in bytes) for the message buffer on the AEL Spark driver. For example, to allocate a 4 MB limit, set the memory value as shown here:
Save and close the file.
Spark SQL catalyst errors using the Merge or Group By steps
If you are using the Spark engine with the Merge Rows (diff), Merge Join, or Group By step, you might receive an error similar to the following message:
Field names for join keys (values to compare or group) cannot contain special characters, such as whitespaces or dashes.
To resolve the issue, remove the special characters from the field names within your transformation.
Performance or memory issues
If you experience performance or memory issues while running your PDI transformation on the Spark engine, your transformation may not be efficiently using Spark execution resources.
To resolve or minimize the issue, apply and adjust application and PDI step Spark tuning parameters. See About Spark tuning in PDI for details.
Multiple steps in a transformation cannot generate files to the same location
If your transformation contains multiple steps that generate output files to the same destination folder, the files might be missing or the data might be missing.
Spark requires unique names for the files and folders generated by each step. To resolve this issue, send the files from each step to unique folders with unique filenames.