Skip to main content
Pentaho Documentation

Avro Output

The Avro output step serializes data into an Avro binary or JSON format from the PDI data stream, then writes it to file. Apache Avro is a data serialization system. Avro relies on schema for decoding binary and extracting data.

This output step creates the following files:

  • A file containing output data in the Avro format
  • An Avro schema file defined by the fields in this step

Fields can be defined manually or extracted from incoming steps.

AEL Considerations

When using the Avro Output step with the Adaptive Execution Layer (AEL), the following factors affect performance and results:

  • Spark processes null values differently than the Pentaho engine. You will need to adjust your transformation to successfully process null values according to Spark's processing rules.
  • Metadata injection is not supported for steps running on AEL.

General

Enter the following information in the transformation step fields:

Field Description
Step name Specifies the unique name of the Avro Output step on the canvas. You can customize the name or leave it as the default.
Location

Indicates the file system type or specific cluster on which the item you want to output can be found. For the supported file system types, see Virtual File System Browser.

Folder/File name

Specifies the location and/or name of the file or folder to which to write. Click Browse to display the Open File window and navigate to the file or folder.

  • When running on the Pentaho engine, the Avro files are created.

  • When running on the Spark engine, a folder is created with Avro files.
Overwrite existing output file Select to overwrite an existing file that has the same file name and extension.

Options

The Avro Output transformation step features several tabs with fields. Each tab is described below.

Fields Tab

PDI_AvroOutput_Fields_Dialog.png

The table in the Fields tab defines the following fields that make up the Avro schema created by this step:

Field Description
Avro path The name of the field as it will appear in the Avro data and schema files. 
Name The name of the PDI field. 
Avro type Defines the Avro data type of the field. 
Precision Applies only to the Decimal Avro type, the total number of digits in the number. The default is 10.
Scale Applies only to the Decimal Avro type, the number of digits after the decimal point. The default is 0
Default value The default value of the field if it is null or empty. 
Null Specifies if the field can contain null values.

To avoid a transformation failure, make sure the Default value field contains values for all fields where Null is set to No.

As shown in the table below, you can click Get Fields to populate the fields from the incoming PDI stream or these fields can be defined manually. During the retrieval of fields, a PDI type is converted to an appropriate Avro type. If desired, you can change the converted field type to another Avro type. 

PDI Type Avro Type (non AEL) Avro Type (AEL)
InetAddress String String
String String String
TimeStamp TimeStamp Long
Binary Bytes Bytes
BigNumber Decimal

Not supported 

Get Fields provides field conversion from BigNumber to Decimal. However, Decimal types are not supported when running a transformation in AEL, so you must convert the field to another appropriate Avro type.

Boolean Boolean Boolean
Date Date Integer
Integer Long Long
Number Double Double

Schema Tab

Schema Tab for the Avro Output Step

The following options in the Schema tab define how the Avro schema file will be created:

Option Description
File name

Specifies the fully qualified URL where the Avro schema file will be written. The URL may be in a different format depending on file system type (Location field). If a schema file already exists, it will be overwritten. If you do not specify a separate schema file for your output, PDI will write an embedded schema in your Avro data file.

Namespace Specifies the name, together with the Record name field, that defines the "full name" of the schema (‘example.avro’ for example).
Record name Specifies the name of the Avro record (‘User’ for example).
Doc value Specifies the documentation provided for the schema.

Options Tab

PDI_AvroOutput_dlg_OptionsTab.png

Option Description
Compression

Specifies which of the following codecs is used to compress data blocks in the Avro output file:

  • None: No compression is used (default).   
  • Deflate: The data blocks are written using the deflate algorithm as specified in RFC 1951, and typically implemented using the zlib library.
  • Snappy: The data blocks are written using Google's Snappy compression library, and are followed by the 4-byte, big-endian CRC32 checksum of the uncompressed data in each block.

See https://avro.apache.org/docs/1.8.1/spec.html#Object+Container+Files for additional information on these codecs.

Include date in filename Add the system date that the file was generated to the output file name with the default format yyyyMMdd (20181231 for example).
Include time in filename Add the system time that the file was generated to the output file name with the default format HHmmss (235959 for example).
Specify date time format Add a different date time format to the output file name from the options available in the drop-down list.

Metadata Injection Support

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.

Metadata injection is not supported for steps running on the Adaptive Execution Layer (AEL).