Parquet Output
The Parquet Output step allows you to map PDI fields to fields within data files and choose where you want to process those files, such as on HDFS. For big data users, the Parquet Output and the Parquet Input transformation steps ease the process of gathering raw data from various sources and moving that data into the Hadoop ecosystem to create a useful, summarized data set for analysis. Depending on your setup, you can execute the transformation within PDI or within the Adaptive Execution Layer (AEL), using Spark as the processing engine.
Before using the Parquet Output step, you will need to select and configure the shim for your distribution, even if your Location is set to 'Local'. The Parquet Output step requires the shim classes to read the correct data. For information on configuring a shim for a specific distribution, see Set Up Pentaho to Connect to a Hadoop Cluster.
General
Enter the following information in the transformation step name field.
Option | Description |
---|---|
Step Name | Specifies the unique name of the Parquet Output step on the canvas. The Step Name is set to ‘Parquet Output' by default. |
Location |
Indicates the file system or specific cluster on which the item you want to output can be found. Options are as follows:
|
Folder/File name |
Specifies the location and/or name of the file or folder to which to write. Click Browse to display the Open File window and navigate to the file or folder.
|
Overwrite existing output file | Select to overwrite an existing file that has the same file name and extension as the one created here. |
Options
The Parquet Output step features several tabs with fields for defining results. Each tab is described below.
Fields Tab
In this tab, you can define properties for the fields being exported. The table below describes each of the options for configuring the field properties.
Option | Description |
---|---|
Path | The location of the data for this field. |
Name | The name of the output field. |
Type | The type of the output field, such as String or Date. |
Default value | Specify the value to be used for the field if the value read is Null. |
Null | Indicates whether the field can be Null. If the field cannot be Null, then the Default Value must be specified. |
Get Fields | Click this button to insert the list of fields from the input stream into the Fields table. |
Options Tab
In this tab, you can define properties for the file output.
Option | Description |
---|---|
Compression |
Select one of the following compression types for this file:
|
Version |
Select the version of Parquet you want to use. Options include:
|
Row group size (MB) | Specify the group size for the rows. |
Data page size (KB) | Specify the page size for the data. |
Dictionary encoding and Page size (KB) |
Select to specify dictionary encoding and dictionary page size. Selecting dictionary encoding builds a dictionary of values encountered in a column. Note that if the dictionary grows larger than this field value, whether in size or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is written first, before the data pages of the column. The default value for page size is '1024' KB. |
Extension and options |
Select the extension for your output file. The default value is 'parquet'. Optionally, you can append the following extensions to the output file name:
|
Metadata Injection Support
All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.