The Text file input step reads data from a variety of text-file types, including formats generated by spreadsheets and fixed width flat files. The features of the step allow you to read from a list of files or directories, use wild cards in the form of regular expressions, and accept genericized filenames from previous steps.
When using the Text file input step with the Adaptive Execution Layer, the following factor affects performance and results:
- Spark processes null values differently than the Pentaho engine. You will need to adjust your transformation to successfully process null values according to Spark's processing rules.
- If you are using this step to extract data from Amazon Simple Storage Service (S3), browse to the URI of the S3 system or specify the Uri field option in the Additional output fields tab. S3 and S3n are supported.
Enter the following information in the transformation step name field:
- Step name: Specify the unique name of the Text file input step on the canvas. You can customize the name or leave it as the default.
You can use Preview rows to display the rows generated by this step. The Text file input step determines what rows to input based on the information you provide in the option tabs. This preview function helps you to decide if the information provided accurately models the rows you are trying to retrieve.
The Text file input step features several tabs with fields. Each tab is described below.
Use the File tab to enter the following connection information for your source.
|File or directory||Specify the source location if the source is not defined in a field. Click Browse to navigate to your source file or directory. Click Add to include the source in the Selected files table. If the source location is defined in a field, use the Accept filenames from previous steps to specify your file name.|
|Regular expression||Specify a regular expression to match filenames within a specified directory.|
|Exclude regular expression||Specify a regular expression to exclude filenames within a specified directory.|
Use the Wildcard (RegExp) field in the File tab to search for files by wildcard in the form of a regular expression. Regular expressions are more sophisticated than using * and ? wildcards. This table describes several examples of regular expressions.
|File Name||Regular Expression||Files Selected|
|/dirA/||.userdata.\.txt||Find all files in /dirA/ with names containing user data and ending with .txt|
|/dirB/||AAA.\*||Find all files in /dirB/ with names that start with AAA|
|/dirC/||\[ENG:A-Z\]\[ENG:0-9\].\*||Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9)|
Selected files table
The Selected files table shows files or directories to use as source locations for input. This table is populated by clicking Add after you specify a File or directory. The input step tries to connect to the specified file or directory when you click Add to include it in the table.
The table contains the following columns:
|File/Directory||The source location indicated by clicking Add after specifying it in File or directory.|
|Wildcard (RegExp)||Specify a regular expression to match filenames within a specified directory.|
|Exclude wildcard||Specify a regular expression to exclude filenames within a specified directory.|
|Required||Required source location for input.|
|Include subfolders||Whether subfolders are included within the source location.|
Click Delete to remove a source from the table. Click Edit to remove a source from the table and return it back to the File or directory option.
Accept file names
You can specify your file name and pass it to the input step, which allows the file name to come from any source, such as a text file or database table.
|Accept filenames from previous step||Select to get file names from previous steps.|
|Pass through fields from previous step||Select to get field information from previous steps.|
|Step to read file names from||Enter the name of the step from which to read the file names.|
|Field in the input to use as filename||Enter the name of the field in the input step to determine which file name to use.|
Show action buttons
When you have entered information in the File tab fields, select an action button if you want to look at the source file names or data content.
|Show filename(s)||Select to display the file names of the sources connected to the step.|
|Show file content||Select to display the raw content of the selected file.|
|Show content from first data line||Select to display the content from the first data line for the selected file.|
In the Content tab, using the following options, you can specify the format of the source files.
|Filetype||Select either CSV or Fixed length. Depending on the file type you select, a corresponding interface appears when you click Get Fields in the Fields tab.|
|Separator||Specify the character used to separate the fields in a single line of text, typically a semicolon or tab. Click Insert Tab to place a tab in the Separator field. The default value is semicolon (;).|
|Enclosure||Specify an optional character used to enclose a field if that field contains a separator character. The default value is double quotation marks (").|
|Allow breaks in enclosed fields||Not implemented.|
|Escape||Specify one or more characters to indicate if another character is a part of a regular text. For example, if a backslash (\) is the escape character and a single quote (') is an enclosure or separator character, then the text Not the nine o\’clock news is parsed as Not the nine o’clock news.|
|Header||Select if your text file has a header row (first lines in the file). You can use Number of header lines to specify the number of times the header line appears.|
|Footer||Select if your text file has a footer row (last lines in the file). You can use Number of footer lines to specify the number of times the footer row appears.|
|Wrapped lines||Select if you work with data lines that have wrapped beyond a specific page limit. You can use Number of times wrapped to specify the number of times the line is wrapped. Headers and footers are never considered wrapped.|
|Paged layout (printout)||Select when other text handling options (above) fail on a text file designed to be output to a line printer. You can use Document header lines to skip introductory texts and Number of lines per page to position the data lines.|
|Compression||Select if your text file is in a ZIP or GZip archive. Only the first file in the archive is read.|
|No empty rows||Select if you do not want to send empty rows to the next steps.|
|Include filename in output||Select if you want the file name to be part of the output, and use Filename fieldname to enter the name of the field that contains the file name.|
|Rownum in output||Select if you want the row number to be part of the output. You can use Rownum fieldname to enter the name of the field that contains the row number. Select Rownum by file if you want to allow the row number to be reset per file.|
|Format||Select the file format, which can be either DOS, UNIX, or mixed. UNIX files have lines terminated by line feeds. DOS files have lines separated by carriage returns and line feeds. If you specify mixed, no verification is done.|
|Encoding||Select the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, the PDI client searches your system for available encodings.|
Select the length of the field according to its type:
|Limit||Specify a limit on the number of records generated from this step. Specify zero (0) for an unlimited number of records.|
|Be lenient when parsing dates?||Clear the check box if you want strict parsing of data fields. If selected, dates like Jan 32nd become Feb 1st.|
|The date format Locale||Specify the locale to use to parse dates written in full, such as February 2nd, 2006. For example, parsing February 2nd, 2006, on a system set to French (fr_FR) would not work because February is called Février in that locale.|
|Add filenames to result||Select to add file names to a resulting list of file names.|
Error Handling tab
In the Error Handling tab, you can specify how the step reacts when errors occur, such as malformed records, bad enclosure strings, wrong number of fields, and premature line ends. The following table contains options for error handling:
|Ignore errors?||Select if you want to ignore errors during parsing.|
|Skip error files?||Select if you want to skip those files that contain errors. You can generate a file that contains a listing of files where the errors occur. Otherwise, files with errors are not skipped, and the files that have parsing errors are empty (null).|
|Error file field name||Specify an error file name if you want to add field names where errors were occurred.|
|File error message field name||Specify an error message field name if you want to add field names where errors occurred in the error file.|
|Skip error lines?||Select if you want to skip those lines that contain errors. You can generate an extra file that contains the line numbers where the errors occur. Otherwise, lines with errors are not skipped, and the fields that have parsing errors are empty (null).|
|Error count fieldname||Specify the field name if you want to add a field containing the number of errors on the line to the output rows.|
|Error fields fieldname||Specify the field name if you want to add a field containing the names of fields where errors occurred to the output rows.|
|Error text fieldname||Specify the field name if you want to add a field containing descriptions of the parsing error occurrences to the output rows.|
|Warning files directory||Specify the location of the directory where warnings are placed if they are generated. The name of the resulting file is <warning dir>/filename.<date_time>.<warning extension>.|
|Error files directory||Specify the location of the directory where errors are placed if they occur. The name of the resulting file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>.|
|Failing line numbers files directory||Specify the location of the directory where parsing errors on a line are placed if they occur. The name of the resulting file is <errorline dir>/filename.<date_time>.<errorline extension>.|
The Filters tab contains a table with the columns where you can specify the lines you want to skip in the text file.
|Filter string||The string that you want to search for.|
|Filter position||The position where the filter string must be placed in the line. Zero (0) is the first position in the line. If you specify a value below zero (0), the filter string is searched for in the entire string.|
|Stop on filter||Enter Y if you want to stop processing the current text file when the filter string is encountered. Enter N to continue processing after encountering the string.|
|Positive match||Enter Y if you want to process lines that match the filter string. Enter N to ignore matching lines.|
The Fields tab contains a table with the columns where you can specify information about the fields being read from the text file.
|Name||The name of the field.|
|Type||The data type of the field.|
|Format||See Number formats for a description of format symbols.|
|Position||The position is needed when processing the Fixed file type. It is zero based, so the first character is starting with position 0.|
The value of this field depends on format:
The value of this field depends on format:
|Currency||Used to interpret numbers such as $10,000.00 or E5.000,00.|
|Decimal||A decimal point can be a period (.) as in 10,000.00 or it can be a comma (,) as in 5.000,00.|
|Group||A grouping can be a comma (,) as in 10,000.00 or (.) as in 5.000,00.|
|Null if||Used to set as null (empty) if the string is equal to the specified value.|
|Default||Used to specify a default value to use in case the field in the text file was not specified (empty).|
|Trim type||The trimming method to apply to a string, which truncates the field before
processing. Trimming only works when no field length is specified. You can specify
one of the following options:|
|Repeat||If the corresponding value in this row is empty, repeat the one from the last time it was not empty (Y or N).|
Get Fields (button)
|Click to retrieve a list of fields from the input stream.|
Minimal width (button)
|Click to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length.|
Use the following table to specify number formats. For further information on valid numeric formats used in this step, view the Number Formatting Table.
|#||Number||Yes||Digit, zero shows as absent.|
|.||Number||Yes||Decimal separator or monetary decimal separator.|
|E||Number||Yes||Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.|
|;||Subpattern boundary||Yes||Separates positive and negative patterns.|
|%||Prefix or suffix||Yes||Multiply by 100 and show as percentage.|
|‰(/u2030)||Prefix or suffix||Yes||Multiply by 1000 and show as per mille.|
|¤ (/u00A4)||Prefix or suffix||No||Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.|
|‘||Prefix or suffix||No||Used to quote special characters in a prefix or suffix, for example, '#'# formats 123 to #123. To create a single quote itself, use two in a row: # o''clock.|
In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation, for example, 0.###E0 formats the number 1234 as 1.234E3.
Use the following table to specify date formats. For further information on valid date formats used in this step, view the Date Formatting Table.
|Letter||Date of Time Component||Presentation||Examples|
|y||Year||Year||1996 or 96|
|M||Month in year||Month||July, Jul, or 07|
|w||Week in year||Number||27|
|W||Week in Month||Number||2|
|D||Day in year||Number||189|
|d||Day in month||Number||10|
|F||Day of week in month||Number||2|
|E||Day in week||Text||Tuesday or Tue|
|H||Hour in day (0-23)||Number 0||n/a|
|k||How in day (1-24)||Number 24||n/a|
|K||Hour in am/pm (0-11)||Number 0||n/a|
|h||Hour in am/pm (1-12)||Number 12||n/a|
|m||Minute in hour||Number 30||n/a|
|s||Second in minute||Number 55||n/a|
|z||Time zone||General time zone||Pacific Standard Time, PST, or GMT-08:00|
|Z||Time zone||RFC 822 time zone||-0800|
Additional output fields tab
The Additional output fields tab contains the following options to specify additional information about the file to process.
|Short filename field||Specify the field that contains the filename without path information but with an extension.|
|Extension field||Specify the field that contains the extension of the filename.|
|Path field||Specify the field that contains the path in operating system format.|
|Size field||Specify the field that contains the size of the data.|
|Is hidden field||Specify the field indicating if the file is hidden or not (Boolean).|
|Last modification field||Specify the field indicating the date of the last time the file was modified.|
|Uri field||Specify the field that contains the URI.|
|Root uri field||Specify the URI output field name.|