Skip to main content
Pentaho Documentation

Parquet Input

The Parquet Input step decodes Parquet data formats and extracts fields based on the structure it defines from source files. For big data users, the Parquet Input and the Parquet Output transformation steps ease the process of gathering raw data from various sources and moving that data into the Hadoop ecosystem to create a useful, summarized data set for analysis. Depending on your setup, you can execute the transformation within PDI or within the Adaptive Execution Layer (AEL), using Spark as the processing engine.

Before using the Parquet Input step, you will need to select and configure the shim for your distribution, even if your Location is set to 'Local'. The Parquet Input step requires the shim classes to read the correct data. For information on configuring a shim for a specific distribution, see Set Up Pentaho to Connect to a Hadoop Cluster.

Options

 
PDITrans_ParquetInput_Latest.png
 
The Parquet Input transformation step features the following options.
 
Option Description
Step Name Specifies the unique name of the Parquet Input step on the canvas. You can customize the name or leave it as the default.
Location

Indicates the file system or specific cluster on which the source file you want to input can be found. Options are as follows:

  • Local: Specifies that the source file is in a file system that is local to the PDI client.
  • S3: Specifies that the source file is in a file system that is on the S3 file system.
  • HDFS: Specifies that the source file is in a file system that is on HDFS.
  • Named Cluster: Specifies that the source file is in the cluster indicated.
Folder/File name

The full name of the source file for the input fields.

  • When running on the Pentaho engine, a single Parquet file is specified to read as input.
  • When running on the Spark engine, a folder is specified and all the Parquet files within that folder are read as input.
Fields

Specify the following information for the input fields:

  • Path: The location of the source for this field.
  • Name: The name of the input field.
  • Type: The type of the input field, such as String or Date.
Get Fields Click this button to insert the list of fields from the input stream into the Fields table.
Preview Click this button to preview the rows generated by this step.

Metadata Injection Support

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.