Skip to main content
Pentaho Documentation

CSV File Input

The CSV File Input step reads data from delimited text files into a PDI transformation. While this step is called ‘CSV File Input’, you can also use CSV File Input with many other separator types, such as pipes, tabs, and semicolons.

The semicolon (;) is set as the default separator type for this step.

The options for this step are a subset of the Text File Input step. This step differs from the Text File Input step in the following ways:

  • NIO – Non-blocking I/O is used for native system calls to read the file faster, but is limited to local files. It does not support VFS.
  • Parallel Running – If you configure this step to run in multiple copies (or in a clustered mode) and you enable parallel running, each copy will read a separate block of a single file. You can distribute the reading of a file to several threads or even several slave nodes in a clustered transformation.
  • Lazy Conversion – If you are reading many fields from a file and many of those fields will not be manipulated but merely passed through the transformation to land in some other text file or a database, lazy conversion can prevent PDI from performing unnecessary work on those fields (such as converting them into objects like strings, dates, or numbers).

An example of a simple CSV input transformation (CSV Input - Reading customer data.ktr) can be found in the data-integration/samples/transformations directory.

Options

Properties Dialog Box for the CSV File Input PDI Step

The CSV File Input step has the following options:

Option Description
Step name

Specify the unique name of the CSV File Input step on the canvas. You can customize the name or leave it as the default.

Filename (or The filename field)

Specify one of the following names:

  • The name (Filename) of the CSV source file
  • The name (The filename field) of the field that contains the name(s) of multiple source files if the CSV File Input step receives data from another step. If the CSV File Input step receives data, the Include the filename in output? option also appears.
Include the filename in the output?

If the CSV File Input receives data from another step, indicate the if the name of the input source file should be included in the output of the CSV File Input step.

Delimiter

Specify the file delimiter character used in the source file. Special characters (e.g. CHAR HEX01) can be set with the format $[value], e.g. $[01] or $[6F,FF,00,1F].

The default delimiter for the CSV File Input step is a semicolon (;).

Enclosure

Specify the enclosure character used in the source file. Special characters (e.g. CHAR HEX01) can be set with the format $[value], e.g. $[01] or $[6F,FF,00,1F].

NIO buffer size

Specify the size of the read buffer, the number of bytes that is read at one time from the source.

Lazy conversion

Indicate if the lazy conversion algorithm may be used to improve performance. The lazy conversion algorithm tries to avoid unnecessary data type conversions if possible. It can result in significant performance improvements. The typical example is reading from a text file and writing back to a text file.

Header row present?

Indicate if the source file contains a header row containing column names.

Add filename to result

Adds the CSV source filename(s) to the result of this transformation.

The row number field name (optional)

Specify the name of the field that will contain the row number in the output of this step.

Running in parallel?

Indicate if you will have multiple instances of this step running (step copies) and if you want each instance to read a separate part of the CSV file(s).

When reading multiple files, the total size of all files is taken into consideration to split the workload. In that specific case, make sure that ALL step copies receive all files that need to be read, otherwise, the parallel algorithm will not work correctly.

For technical reasons, parallel reading of CSV files is only supported on files that do not have fields with line breaks or carriage returns in them.

New line possible in fields?

Indicate if data fields may contain new line characters.

File encoding

Specify the encoding of the source file.

Fields

You can specify what fields to read from your CSV file through the Fields table. Click Get fields to have the step populate the table with fields derived from the source file based on the current specified settings (such as Delimiter or Enclosure). All fields identified by this step will be added to the table.

The table contains the following columns:

Column Description
Name Name of the field
Type Type of field (either String, Date, or Number)
Format

An optional mask for converting the format of the original field. See Common Formats for information on common valid date and numeric formats you can use in this step. 

Length

The length of the field depends on the following field types:

  • Number: Total number of significant figures in a number
  • String: Total length of string
  • Date: Length of printed output of the string (for example, four is a length for a year)
Precision

Number of floating point digits for number-type fields

Currency

Symbol used to represent currencies ($5,000.00 or €5.000,00 for example)

Decimal

A decimal point can be a "." or "," (5,000.00 or 5.000,00 for example)

Group

A grouping can be a "," or "." (5,000.00 or 5.000,00 for example)

Trim Type

The trimming method to apply to a string

Click Preview to view the data coming from the source file.

Metadata Injection Support

You can use the Metadata Injection supported fields with ETL Metadata Injection step to pass metadata to your transformation at runtime. The following Option and Value fields of the CSV File Input step support metadata injection:

  • Options: Filename, Delimiter, Enclosure, NIO Buffer Size, Lazy Conversion, Header Row Present?, Add Filename to Result, The Row Number Field Name, Running in Parallel?, and File Encoding.
  • Values: Name, Length, Decimal, Type, Precision, Group, Format, Currency, and Trim Type.