Skip to main content
Pentaho Documentation

ORC Output

The ORC Output step serializes data from the PDI data stream into an ORC file format, and then writes it to a file. ORC is a data format for fast columnar storage. This step creates a file containing output data in the ORC format.

Fields written to the ORC output file are defined by the input fields. Fields not written to the output file are either deleted, or are written to the output file with alternate field names or default values.

AEL Considerations

When using the ORC Output step with the Adaptive Execution Layer (AEL), the following factors affect performance and results:

  • Spark processes null values differently than the Pentaho engine. You will need to adjust your transformation to successfully process null values according to Spark's processing rules.
  • Metadata injection is not supported for steps running on AEL.

General

Enter the following information in the transformation step fields:

Option Description
Step name Specifies the unique name of the ORC Output step on the canvas. You can customize the name or leave it as the default.
Location

Indicates the file system or specific cluster on which the item you want to output can be found. For the supported file system types, see Virtual File System Browser.

Folder/File name

Specifies the location and/or name of the file or folder to write. Click Browse to display the Open File window and navigate to the file or folder.

  • When running on the Pentaho engine, the ORC files (file:///C:/orc-output-file for example) are created. 
  • When running on the Spark engine, a folder is created that may contain multiple ORC files.
Overwrite existing output file Select to overwrite an existing file that has the same file name and extension.

Options

The ORC Output step features two tabs with fields. Each tab is described below.

Fields Tab

PDI_OrcOutputFields.png

In the Fields tab, you can define fields that make up the ORC Type description created by this step. The table below describes each of the options for configuring the ORC Type description.

Field Description
ORC path Specify the name of the field as it will appear in the ORC data file or files.
Name Specify the name of the PDI field.
ORC type Defines the data type of the field.
Precision Specify the total number of digits in the number (only applies to the Decimal ORC type).  The default value is 20.
Scale Specify the number of digits after the decimal point (only applies to the Decimal ORC type). The default value is 10.
Default value Specify the default value of the field if it is null or empty.
Null Specifies if the field can contain null values. 

To avoid a transformation failure, make sure the Default value field contains values for all fields where Null is set to No.

You can define the fields manually, or you can provide a path to the PDI data stream and click Get Fields to populate all the fields. During the retrieval of the fields, a PDI type is converted into an appropriate ORC type, as shown in the table below. You can also change the selected ORC type by using the Type drop-down or by entering the type manually.

PDI Type ORC Type (non AEL)
InetAddress String
String String
TimeStamp TimeStamp
Binary Binary
BigNumber Decimal
Boolean Boolean
Date Date
Integer Integer
Number Double

Options Tab

PDI_OrcOutputOptions.png

The following options in the Options tab define how the ORC Output file will be created.

Field Description
Compression

Specifies which codec is used to compress the ORC Output file:

  • None: No compression is used (default). 
  • Zlib: Writes the data blocks using the deflate algorithm, as specified in RFC 1951, and typically implemented using the zlib library.
  • LZO: Writes the data blocks using LZO encoding, which works well for CHAR and VARCHAR columns that store very long character strings.
  • Snappy: Using Google's Snappy compression library, writes the data blocks that are followed by the 4-byte, big-endian CRC32 checksum of the uncompressed data in each block.
Stripe size (MB)

Defines the stripe size in megabytes. An ORC file has one or more stripes. Each stripe is composed of rows of data, an index of the data, and a footer containing metadata about the stripe’s contents. Large stripe sizes enable efficient reads from HDFS. The default is 64.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC for additional information.

Compress size (KB) Defines the number of kilobytes in each compression chunk. The default is 256
Inline Indexes If checked, rows are indexed when written for faster filtering and random access on read.
Rows between entries Defines the stride size or number of rows between index entries (must be greater than or equal to 1000). The stride size is the block of data that can be skipped by the ORC reader during a read operation based on the indexes. The default is 10000
Include date in file name Adds the system date to the filename with format yyyyMMdd (20181231 for example).
Include time in file name Adds the system time to the filename with format HHmmss (235959 for example).
Specify date time format Select to specify the date time format using the dropdown list.

Important: Due to licensing constraints, ORC does not ship with LZO compression libraries; these must be manually installed on each node if you want to use LZO compression.

AEL Support

Depending on your setup, you can execute the ORC Output step within the Adaptive Execution Layer (AEL) using Spark as the processing engine. In AEL, the ORC Output step will automatically convert an incoming Spark SQL row to a row in the ORC output file, where the Spark types determine the ORC types that get written to the ORC file.

ORC Type Desired Spark Type Used
Boolean Boolean
TinyInt Unsupported*
SmallInt Short
Integer Integer
BigInt Long
Binary Binary
Float Float
Double Double
Decimal BigNumber
Char Unsupported*
VarChar Unsupported*
TimeStamp TimeStamp
Date Date

* Some ORC types are not supported as there are no equivalent data types for conversion in Spark.

Metadata Injection Support

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.

Metadata injection is not supported for steps running on the Adaptive Execution Layer (AEL).