Skip to main content
Pentaho Documentation

Parquet Output

The Parquet Output step allows you to map PDI fields to fields within data files and choose where you want to process those files, such as on HDFS. For big data users, the Parquet Output and the Parquet Input transformation steps ease the process of gathering raw data from various sources and moving that data into the Hadoop ecosystem to create a useful, summarized data set for analysis. Depending on your setup, you can execute the transformation within PDI or within the Adaptive Execution Layer (AEL), using Spark as the processing engine.

Before using the Parquet Output step, you will need to select and configure the shim for your distribution, even if your Location is set to 'Local'. The Parquet Output step requires the shim classes to read the correct data. For information on configuring a shim for a specific distribution, see Set Up Pentaho to Connect to a Hadoop Cluster.

General

PDITrans_ParquetOutput_Latest.png

Enter the following information in the transformation step name field. 

Option Description
Step Name Specifies the unique name of the Parquet Output step on the canvas. The Step Name is set to ‘Parquet Output' by default.
Location

Indicates the file system or specific cluster on which the item you want to output can be found. Options are as follows:

  • Local: Specifies that the item entered in the File name field is in a file system that is local to the PDI client.
  • S3: Specifies that the item entered in the File name field is in a file system that is on the S3 file system.
  • HDFS: Specifies that the item entered in the File name field is on the HDFS.
  • Named Cluster: Specifies that the item specified in the File name field is in the cluster indicated.
 
Folder/File name

Specifies the location and/or name of the file or folder to which to write. Click Browse to display the Open File window and navigate to the file or folder.

  • When running on the Pentaho engine, a single Parquet file is created.
  • When running on the Spark engine, a folder is created with Parquet files.
Overwrite existing output file Select to overwrite an existing file that has the same file name and extension as the one created here.

Options

The Parquet Output step features several tabs with fields for defining results. Each tab is described below.

Fields Tab

PDITrans_ParquetOutput_FieldsTab_Latest.png

In this tab, you can define properties for the fields being exported. The table below describes each of the options for configuring the field properties.

Option Description
Path The location of the data for this field.
Name The name of the output field.
Type The type of the output field, such as String or Date.
Default value Specify the value to be used for the field if the value read is Null. 
Null Indicates whether the field can be Null. If the field cannot be Null, then the Default Value must be specified. 
Get Fields Click this button to insert the list of fields from the input stream into the Fields table. 

Options Tab

PDITrans_ParquetOutput_OptionsTab_Latest.png

In this tab, you can define properties for the file output. 

Option Description
Compression

Select one of the following compression types for this file:

  • None
  • Snappy
  • GZIP
  • The default value is ‘None’.
Version

Select the version of Parquet you want to use. Options include:

  • Parquet 1.0
  • Parquet 2.0
Row group size (MB) Specify the group size for the rows.
Data page size (KB) Specify the page size for the data.
Dictionary encoding and Page size (KB)

Select to specify dictionary encoding and dictionary page size. 

Selecting dictionary encoding builds a dictionary of values encountered in a column. Note that if the dictionary grows larger than this field value, whether in size or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is written first, before the data pages of the column.

The default value for page size is '1024' KB.

Extension and options

Select the extension for your output file. The default value is 'parquet'.

Optionally, you can append the following extensions to the output file name:

  • Include date in file name. Select to include the date the file was generated in the output file name.
  • Include time in file name. Select to include the time the file was generated in the output file name.
  • Specify date time format. Select to include the date and time the file was generated in the output file name. Use the drop-down menu to select a specific date/time format for this option.

Metadata Injection Support

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.