The CSV File Input step reads data from delimited text files into a PDI transformation. While this step is called CSV File Input, you can also use CSV File Input with many other separator types, such as pipes, tabs, and semicolons.
Non-blocking I/O is used for native system calls to read the file faster, but is limited to local files. It does not support VFS.
If you configure this step to run in multiple copies (or in a clustered mode) and you enable parallel running, each copy will read a separate block of a single file. You can distribute the reading of a file to several threads or even several slave nodes in a clustered transformation.
If you are reading many fields from a file and many of those fields will not be manipulated but merely passed through the transformation to land in some other text file or a database, lazy conversion can prevent PDI from performing unnecessary work on those fields (such as converting them into objects like strings, dates, or numbers).
An example of a simple CSV input transformation (CSV Input - Reading customer data.ktr) can be found in the data-integration/samples/transformations directory.
The CSV File Input step has the following options:
|Step name||Specify the unique name of the CSV File Input step on the canvas. You can customize the name or leave it as the default.|
|Filename||Specify the name of the input CSV file or navigate to the input
file by clicking Browse.|
If your source is from a previous step, the Browse button is hidden. Use the drop-down menu in the text box to select the field to use as the name or names of your CSV file(s).
|Include the filename in the output? (Only appears if your source is from a previous step)||If your source is from a previous step, select if you want the name of the input source file included in the output.|
|Delimiter||Specify the file delimiter character used in the source file.
Special characters (for example, CHAR HEX01) can be set with
the format $[value]. For example,
$ or $[6F,FF,00,1F]. |
The default delimiter for the CSV File Input step is a semicolon ;.
|Enclosure||Specify the enclosure character used in the source file. Special characters (for example, CHAR HEX01) can be set with the format $[value], such as $ or $[6F,FF,00,1F].|
|NIO buffer size||Specify the size of the read buffer, the number of bytes that is read at one time from the source.|
|Lazy conversion?||Indicate if the lazy conversion algorithm may be used to improve performance. The lazy conversion algorithm tries to avoid unnecessary data type conversions if possible. It can result in significant performance improvements. The typical example is reading from a text file and writing back to a text file.|
|Header row present?||Indicate if the source file contains a header row containing column names.|
|Add filename to result||Adds the CSV source filename(s) to the result of this transformation.|
|The row number field name (optional)||Specify the name of the field that will contain the row number in the output of this step.|
|Running in parallel?||Indicate if you will have multiple instances of this step running
(step copies) and if you want each instance to read a separate part of the CSV
When reading multiple files, the total size of all files is taken into consideration to split the workload. In that specific case, make sure that ALL step copies receive all files that need to be read, otherwise, the parallel algorithm will not work correctly.
Caution: For technical reasons, parallel reading of CSV files is only supported on files that do not have fields with line breaks or carriage returns in them.
|New line possible in fields?||Indicate if data fields may contain new line characters.|
|Format||Select the file format, which can be either DOS, UNIX, or mixed. UNIX files have lines terminated by line feeds. DOS files have lines separated by carriage returns and line feeds. If you specify mixed, no verification is done|
|File encoding||Specify the encoding of the source file.|
You can specify what fields to read from your CSV file through the Fields table. Click Get fields to have the step populate the table with fields derived from the source file based on the current specified settings (such as Delimiter or Enclosure). All fields identified by this step are added to the table.
The table contains the following columns:
|Name||Name of the field.|
|Type||Type of field (either String, Date, or Number).|
|Format||An optional mask for converting the format of the original field. See Common Formats for information on common valid date and numeric formats you can use in this step.|
The length of the field depends on the following field types:
|Precision||Number of floating point digits for number-type fields.|
|Currency||Symbol used to represent currencies ($5,000.00 or €5.000,00 for example).|
|Decimal||A decimal point can be a "." or "," (5,000.00 or 5.000,00 for example).|
|Group||A grouping can be a "," or "." (5,000.00 or 5.000,00 for example).|
|Trim Type||The trimming method to apply to a string.|
Click Preview to view the data coming from the source file.
See Understanding PDI data types and field metadata to maximize the efficiency of your transformation and job results.
Metadata injection support
You can use the metadata injection supported fields with ETL metadata injection step to pass metadata to your transformation at runtime. The following Option and Value fields of the CSV File Input step support metadata injection:
- Options: Filename, Delimiter, Enclosure, NIO Buffer Size, Lazy Conversion, Header Row Present?, Add Filename to Result, The Row Number Field Name, Running in Parallel?, and File Encoding.
- Values: Name, Length, Decimal, Type, Precision, Group, Format, Currency, and Trim Type.