Skip to main content
Pentaho Documentation

Manage Hadoop Configurations through PDI

Provides information on Hadoop configurations.

Within PDI, a Hadoop configuration is the collection of Hadoop libraries required to communicate with a specific version of Hadoop and related tools, such as Hive HBase, Sqoop, or Pig.

Hadoop configurations are defined in the plugin.properties file and are designed to be easily configured within PDI by changing the active.hadoop.configuration property. The plugin.properties file resides in the pentaho-big-data-plugin/ folder.

All Hadoop configurations share a basic structure. Elements of the structure are defined in the table following this code block.

configuration/
|-- lib/
|--  |-- client/
|--  |-- pmr/
|--  '-- *.jar
|-- config.properties
|-- core-site.xml
`-- configuration-implementation.jar
Configuration Element Definition
lib/ Libraries specific to the version of Hadoop this configuration was created to communicate with.
client/ Libraries that are only required on a Hadoop client, for instance hadoop-core-* or hadoop-client-*
pmr/ Jar files that contain libraries required for parsing data in input/output formats or otherwise outside of any PDI-based execution.
*.jar All other libraries required for Hadoop configuration that are not client-only or special pmr jar files that need to be available to the entire JVM of Hadoop job tasks.
config.properties Contains metadata and configuration options for this Hadoop configuration. Provides a way to define a configuration name, additional classpath, and native libraries the configuration requires. See the comments in this file for more details.
core-site.xml Configuration file that can be replaced to set a site-specific configuration, for example hdfs-site.xml would be used to configure HDFS.
configuration-implementation.jar File that must be replaced in order to communicate with this configuration.