Skip to main content
Pentaho Documentation

Avro Input

Apache Avro is a data serialization system. The Avro Input step decodes binary or JSON Avro data and extracts fields from the structure it defines. This step extracts data from an Avro file to be used in the PDI stream.

General

The following fields and button are general to this transformation step:

Field Description
Step Name Specifies the unique name of the Avro Input step on the canvas. You can customize the name or leave it as the default.
Location

Indicates the file system or specific cluster where the source file you want to input is located. Options are as follows:

  • Local: Specifies that the source file is in a file system that is local to the PDI client.
  • Hadoop Cluster: Specifies that the source file is in the cluster indicated.
  • S3: Specifies that the source file is on the S3 file system.
  • HDFS (default): Specifies that the source file is on any Hadoop distributed file system, except MapR.
  • MapRFS: Specifies that the source file is on the MapR file system.
Folder/File Name

The fullly qualified URL of the source file name for the input fields.

  • When running on the Pentaho engine, a single Avro file (file:///C:/avro-input-file for example) is specified to read as input.

  • When running on the Spark engine, a folder is specified and all the Avro files within that folder are read as input.
Preview Display the rows generated by this step.

Options

The Avro Input transformation step features several tabs with fields. Each tab is described below.

Fields Tab

Fields Tab for the Avro Input Step

The table in the Fields tab defines the following input fields from the Avro source:

Field Description
Path

The location of the Avro source

Name The name of the input field
Type The type of the input field, such as ‘String’ or ‘Date’

The default format mask for the date type is yyyy-MM-dd. The default format mask for the timestamp type is yyyy-MM-dd HH:mm:ss.SSS. If the data stored is any other format, and was stored as a string data type, it will not be possible to retrieve the column data. In that case, null will be returned for that column.

You can manually define the fields in the table, or you can click Get Fields to populate them from the incoming PDI stream.

Schema Tab

Schema Tab for the Avro Input Step

This tab includes the following field to define the source for your Avro schema:

  • File name: Specify the Avro schema file by entering its path as a fully qualified URL (file:///C:/avro-output-schema for example) or by clicking Browse. A separate schema file is not required. If you do not specify the schema file, PDI will attempt to retrieve the fields from the embedded schema in the Avro data file.

Metadata Injection Support

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.