Skip to main content
Pentaho Documentation

Avro Input

The Avro Input step decodes binary or JSON Avro data and extracts fields from the structure it defines. Apache Avro is a data serialization system. This step extracts the data from an Avro file to be used in the PDI stream.

AEL Considerations

When using the Avro Input step with the Adaptive Execution Layer (AEL), the following factors affect performance and results:

  • Spark processes null values differently than the Pentaho engine. You will need to adjust your transformation to successfully process null values according to Spark's processing rules.
  • Metadata injection is not supported for steps running on AEL.

General

The following fields and button are general to this transformation step:

Field Description
Step name Specifies the unique name of the Avro Input step on the canvas. You can customize the name or leave it as the default.
Location

Indicates the file system or specific cluster where the source file you want to input is located. For the supported file system types, see Virtual File System Browser.

Folder/File name

The fully qualified URL of the source file name for the input fields.

  • When running on the Pentaho engine, a single Avro file (for example, file:///C:/avro-input-file) is specified to read as input.

  • When running on the Spark engine, a folder is specified and all the Avro files within that folder are read as input.
Preview Display the rows generated by this step.

Options

The Avro Input transformation step features several tabs with fields. Each tab is described below.

Fields Tab

PDI_TransStep_AvroInput_Win10_AvroTypes.png

The table in the Fields tab defines the following input fields from the Avro source:

Field Description
Avro path (Avro type)

The location of the Avro source (and its format type).

Name The name of the input field.
Type The type of the input field, such as ‘String’ or ‘Date’.

After you have provided a path to an Avro data file or Avro schema, click Get Fields to populate the fields. 

These fields represent the Avro schema. When the schema field is retrieved, the Avro type is converted to an appropriate PDI type. A user can change the PDI type. Below is the Avro-to-PDI data type conversion table.

Avro Type PDI Type
String

String

TimeStamp TimeStamp
Bytes Binary
Decimal BigNumber
Boolean Boolean
Date Date
Long Integer
Double Number
int Integer
float Number

The default format mask for the date type is yyyy-MM-dd. The default format mask for the timestamp type is yyyy-MM-dd HH:mm:ss.SSS. If the data stored is any other format, and was stored as a string data type, it will not be possible to retrieve the column data. In that case, null will be returned for that column.

Schema Tab

Schema Tab for the Avro Input Step

This tab includes the following field to define the source for your Avro schema:

  • File name: Specify the Avro schema file by entering its path as a fully qualified URL (file:///C:/avro-output-schema for example) or by clicking Browse. A separate schema file is not required. If you do not specify the schema file, PDI will attempt to retrieve the fields from the embedded schema in the Avro data file.

Metadata Injection Support

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.

Metadata injection is not supported for steps running on the Adaptive Execution Layer (AEL).