Skip to main content
Pentaho Documentation

Set Up Pentaho to Connect to a Hadoop Cluster

Overview

Learn how to connect Pentaho to CDH, HDP, EMR, or MapR

Connecting to a Hadoop cluster can help you to accomplish many business objectives.  For example, you can optimize a data warehouse by offloading less frequently-used data to a Hadoop data lake or  you can mine Hadoop data, then enrich, access, process, and package it as a new service. 

Pentaho can connect to one or more versions of these Hadoop distributions:

  • Cloudera Distribution for Hadoop (CDH)
  • Hortonworks Data Platform (HDP)
  • Amazon Elastic MapReduce (EMR)
  • MapR  

Pentaho also supports many related services such as HDFS, HBase, Oozie, Zookeeper, and Spark.  You can connect to clusters and services from these Pentaho components:  Spoon, Pentaho Data Integration (DI) Server, Pentaho Business Analytics (BA) Server (including Analyzer and Pentaho Interactive Reporting), Pentaho Report Designer (PRD), and Pentaho Metadata Editor (PME).

You must configure Pentaho before you can connect to it.  The configuration process largely involves customizing shim files for each computer and component that you want to connect to the cluster.  Shims are Pentaho-developed adapters that help Pentaho connect to the cluster.  Pentaho regularly develops and releases shims, even in-between releases, so that customers can easily keep abreast of the latest technological developments.  To see which shims are supported for this version of Pentaho, see the Component Reference.

If the Hadoop Distribution that you want to use is not listed, visit Configuring Pentaho for your Hadoop Distro and Version. A previous version of our software might support older Hadoop Distributions. 

To learn how to configure a distribution, click one of the following links: