Skip to main content
Pentaho Documentation

Manage Hadoop configurations through PDI

Parent article

Within PDI, a Hadoop configuration is the collection of Hadoop libraries required to communicate with a specific version of Hadoop and related tools, such as Hive, HBase, Sqoop, or Pig.

Hadoop configurations are defined in the plugin.properties file and are designed to be easily configured within PDI by changing the active hadoop.configuration property. The plugin.properties file resides in the pentaho-big-data-plugin/ folder.

All Hadoop configurations share a basic structure. Elements of the structure are defined following the code sample:

configuration/
|-- lib/
|--  |-- pmr/
|--  '-- *.jar
|-- config.properties
|-- core-site.xml
`-- configuration-implementation.jar
Configuration ElementDefinition
lib/Libraries specific to the version of Hadoop with which this configuration was created to communicate.
pmr/Jar files that contain libraries required for parsing data in input/output formats or otherwise outside of any PDI-based execution.
*.jarAll other libraries required for Hadoop configuration that are not client-only or special PMR JAR files that need to be available to the entire JVM of Hadoop job tasks.
config.propertiesContains metadata and configuration options for this Hadoop configuration. It provides a way to define a configuration name, additional classpath, and native libraries that the configuration requires. See the comments in this file for more details.
core-site.xmlConfiguration file that can be replaced to set a site-specific configuration. For example, hdfs-site.xml would be used to configure HDFS.
configuration-implementation.jarFile that must be replaced to communicate with this configuration.

Include or exclude classes or packages for a Hadoop configuration

You have the option to include or exclude classes or packages from loading with a Hadoop configuration.

Configure these options within the plugin.properties file located at plugins/pentaho-big-data-plugin. For additional information, see the comments within the plugin.properties file.

  • Include Additional Class Paths or Libraries

    To include additional class paths, native libraries, or a user-friendly configuration name, include the directory within classpath property within the big data plugin.properties file.

  • Exclude Classes or Packages

    To exclude classes or packages from duplicate loading by a Hadoop configuration class loader, include them in the ignored.classes property within the plugin.properties file. This is necessary when logging libraries expect a single class shared by all class loaders, as with Apache Commons Logging for example