Skip to main content
Pentaho Documentation



Provides an overview of the guide.

Pentaho Data Integration (PDI) is a flexible tool that allows you to collect data from disparate sources such as databases, files, and applications, and turn the data into a unified format that is accessible and relevant to end users. PDI provides the Extraction, Transformation, and Loading (ETL) engine that facilitates the process of capturing the right data, cleansing the data, and storing the data using a uniform and consistent format.

PDI provides support for slowly changing dimensions, and surrogate key for data warehousing, allows data migration between databases and application, is flexible enough to load giant datasets, and can take full advantage of cloud, clustered, and massively parallel processing environments. You can cleanse your data using transformation steps that range from very simple to very complex. Finally, you can leverage ETL as the data source for Pentaho Reporting.

Note: Dimension is a data warehousing term that refers to logical groupings of data such as product, customer, or geographical information. Slowly Changing Dimensions (SCD) are dimensions that contain data that changes slowly over time. For example, in most instances, employee job titles change slowly over time.

Common Uses of Pentaho Data Integration Include:

  • Data migration between different databases and applications
  • Loading huge data sets into databases taking full advantage of cloud, clustered and massively parallel processing environments
  • Data Cleansing with steps ranging from very simple to very complex transformations
  • Data Integration including the ability to leverage real-time ETL as a data source for Pentaho Reporting
  • Data warehouse population with built-in support for slowly changing dimensions and surrogate key creation (as described above)

Audience and Assumptions

This section is written for IT managers, database administrators, and Business Intelligence solution architects who have intermediate to advanced knowledge of ETL and Pentaho Data Integration Enterprise Edition features and functions.

You must have installed Pentaho Data Integration to examine some of the step-related information included in this document.

If you are novice user, Pentaho recommends that you start by following the exercises in Getting Started with Pentaho Data Integration available in the Pentaho InfoCenter. You can return to this document when you have mastered some of the basic skills required to work with Pentaho Data Integration.

What this Section Covers

This document provides you with information about the most commonly used steps. For more information about steps, see Matt Caster's blog and the Pentaho Data Integration wiki.

Refer to Administer DI Server for information about administering PDI and configuring security.