Skip to main content
Pentaho Documentation

HBase setup for Spark

Parent article

The HBase Input and HBase Output steps can run on Spark with the Adaptive Execution Layer (AEL). These steps can be used with the supported versions of Cloudera Distribution for Hadoop (CDH) and Hortonworks Data Platform (HDP). To read or write data to HBase, you must have an HBase target table on the cluster. If one does not exist, you can create one using HBase shell commands.

NoteDue to Cloudera limitations, the HBase Input step fails when using the specific configuration of Spark in YARN mode with Kerberos.

This article explains how you can set up the Pentaho Server to run these steps.

Set up the application properties file

You must set up the application.properties file to permit Spark jobs on AEL to access the hbase-site.xml file from the HDFS cluster. This setup enables Spark jobs to connect to HBase from the Spark Executors. You must also specify the location of the vendor-specific JARs described below so they can be loaded on the classpath.

Perform the following steps to set up the application.properties file:

Procedure

  1. Navigate to the design-tools/data-integration/adaptive-execution/config folder and open the application.properties file with any text editor.

  2. Set the value of the hbaseConfDir property to the location of your hbase-site.xml file.

  3. Set the value of the extraLib property to the location of the vendor-specific JARs.

    The default value is ./extra.
  4. Save and close the file.

Set up the vendor-specified JARs

Each vendor has differences in their byte conversion for HBase, so you must use the JAR files for the Hadoop distribution you are using.
NoteVendor-specific JARS for HBase are not shipped with Spark or HDFS.

Perform the following steps to set up the vendor-specific JARs:

Procedure

  1. Navigate to the design-tools/data-integration/adaptive-execution/extra directory and delete the three HBase JAR files.

  2. Navigate to the design-tools/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations directory and locate your CDH or HDP distribution folder.

  3. Locate the lib/pmr directory in your distribution folder.

  4. Copy the six HBase files, along with the metrics-core file to the design-tools/data-integration/adaptive-execution/extra folder.

  5. To complete your setup, you must restart the AEL daemon.

Using HBase steps with Amazon EMR 5.21

To use the HBase Input and HBase Output steps with EMR 5.21, you must add the following parameter:

spark.hadoop.validateOutputSpecs=false

You can use any of these methods to set the parameter:

For more information about the properties file and processing Spark parameters, see Specify additional Spark properties.

Specify the parameter in the properties file

Perform the following steps to edit the application.properties file.

Procedure

  1. Navigate to the design-tools/data-integration/adaptive-execution/config folder and open the application.properties file with any text editor.

  2. Find the section labeled as #Base Configuration.

  3. Add the following parameter:

    spark.hadoop.validateOutputSpecs=false
  4. Save and close the file.

Next steps

For more information about the application.properties file, see Specify additional Spark properties.

Specify the parameter in Transformation properties

Perform the following steps to specify the parameter in PDI using the Transformation properties dialog box.

Procedure

  1. Double-click anywhere on the transformation canvas.

    The Transformation properties dialog box appears.
  2. Click the Parameters tab and enter the following information:

    1. In the Parameter column, type spark.hadoop.validateOutputSpecs.

    2. In the Default Value column, type false.

    3. (Optional) Add a descriptive note about why the parameter is included.

  3. Click OK to activate the parameter.

    You can verify it is active in the transformation logging.

Next steps

For more information about processing Spark parameters, see Specify additional Spark properties.

Specify the parameter as an environment variable in PDI

Perform the following steps to specify the parameter in PDI using an environment variable.

Procedure

  1. From the Edit menu, select Set Environment Variables.

    The Set Environment Variables table appears.
  2. Enter the following information:

    1. In the Name column, type spark.hadoop.validateOutputSpecs.

    2. In the Value column, type false.

  3. Click OK to activate the parameter.

    You can verify it is active in the transformation logging.

Next steps

For more information about processing Spark parameters, see Specify additional Spark properties.