Skip to main content
Pentaho Documentation

Text File Input

Parent article

The Text file input step reads data from a variety of text-file types, including formats generated by spreadsheets and fixed width flat files. The features of the step allow you to read from a list of files or directories, use wild cards in the form of regular expressions, and accept genericized filenames from previous steps.

AEL considerations

When using the Text file input step with the Adaptive Execution Layer, the following factor affects performance and results:

  • Spark processes null values differently than the Pentaho engine. You will need to adjust your transformation to successfully process null values according to Spark's processing rules.
  • If you are using this step to extract data from Amazon Simple Storage Service (S3), browse to the URI of the S3 system or specify the Uri field option in the Additional output fields tab. S3 and S3n are supported.

General

Enter the following information in the transformation step name field:

  • Step name: Specify the unique name of the Text file input step on the canvas. You can customize the name or leave it as the default.

You can use Preview rows to display the rows generated by this step. The Text file input step determines what rows to input based on the information you provide in the option tabs. This preview function helps you to decide if the information provided accurately models the rows you are trying to retrieve.

Options

The Text file input step features several tabs with fields. Each tab is described below.

File tab

Text File Input step

Use the File tab to enter the following connection information for your source.

OptionDescription
File or directorySpecify the source location if the source is not defined in a field. Click Browse to navigate to your source file or directory. Click Add to include the source in the Selected files table. If the source location is defined in a field, use the Accept filenames from previous steps to specify your file name.
Regular expressionSpecify a regular expression to match filenames within a specified directory.
Exclude regular expressionSpecify a regular expression to exclude filenames within a specified directory.

Regular expressions

Use the Wildcard (RegExp) field in the File tab to search for files by wildcard in the form of a regular expression. Regular expressions are more sophisticated than using * and ? wildcards. This table describes several examples of regular expressions.

File NameRegular ExpressionFiles Selected
/dirA/.userdata.\.txtFind all files in /dirA/ with names containing user data and ending with .txt
/dirB/AAA.\*Find all files in /dirB/ with names that start with AAA
/dirC/\[ENG:A-Z\]\[ENG:0-9\].\*Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9)

Selected files table

The Selected files table shows files or directories to use as source locations for input. This table is populated by clicking Add after you specify a File or directory. The input step tries to connect to the specified file or directory when you click Add to include it in the table.

The table contains the following columns:

ColumnDescription
File/DirectoryThe source location indicated by clicking Add after specifying it in File or directory.
Wildcard (RegExp)Specify a regular expression to match filenames within a specified directory.
Exclude wildcardSpecify a regular expression to exclude filenames within a specified directory.
RequiredRequired source location for input.
Include subfoldersWhether subfolders are included within the source location.

Click Delete to remove a source from the table. Click Edit to remove a source from the table and return it back to the File or directory option.

Accept file names

Accept filenames from previous steps

You can specify your file name and pass it to the input step, which allows the file name to come from any source, such as a text file or database table.

OptionDescription
Accept filenames from previous stepSelect to get file names from previous steps.
Pass through fields from previous stepSelect to get field information from previous steps.
Step to read file names fromEnter the name of the step from which to read the file names.
Field in the input to use as filenameEnter the name of the field in the input step to determine which file name to use.

Show action buttons

Show action buttons on Files tab

When you have entered information in the File tab fields, select an action button if you want to look at the source file names or data content.

ButtonDescription
Show filename(s)Select to display the file names of the sources connected to the step.
Show file contentSelect to display the raw content of the selected file.
Show content from first data lineSelect to display the content from the first data line for the selected file.

Content tab

Content tab

In the Content tab, using the following options, you can specify the format of the source files.

OptionDescription
FiletypeSelect either CSV or Fixed length. Depending on the file type you select, a corresponding interface appears when you click Get Fields in the Fields tab.
SeparatorSpecify the character used to separate the fields in a single line of text, typically a semicolon or tab. Click Insert Tab to place a tab in the Separator field. The default value is semicolon (;).
EnclosureSpecify an optional character used to enclose a field if that field contains a separator character. The default value is double quotation marks (").
Allow breaks in enclosed fieldsNot implemented.
EscapeSpecify one or more characters to indicate if another character is a part of a regular text. For example, if a backslash (\) is the escape character and a single quote (') is an enclosure or separator character, then the text Not the nine o\’clock news is parsed as Not the nine o’clock news.
HeaderSelect if your text file has a header row (first lines in the file). You can use Number of header lines to specify the number of times the header line appears.
FooterSelect if your text file has a footer row (last lines in the file). You can use Number of footer lines to specify the number of times the footer row appears.
Wrapped linesSelect if you work with data lines that have wrapped beyond a specific page limit. You can use Number of times wrapped to specify the number of times the line is wrapped. Headers and footers are never considered wrapped.
Paged layout (printout)Select when other text handling options (above) fail on a text file designed to be output to a line printer. You can use Document header lines to skip introductory texts and Number of lines per page to position the data lines.
CompressionSelect if your text file is in a ZIP or GZip archive. Only the first file in the archive is read.
No empty rowsSelect if you do not want to send empty rows to the next steps.
Include filename in outputSelect if you want the file name to be part of the output, and use Filename fieldname to enter the name of the field that contains the file name.
Rownum in outputSelect if you want the row number to be part of the output. You can use Rownum fieldname to enter the name of the field that contains the row number. Select Rownum by file if you want to allow the row number to be reset per file.
FormatSelect the file format, which can be either DOS, UNIX, or mixed. UNIX files have lines terminated by line feeds. DOS files have lines separated by carriage returns and line feeds. If you specify mixed, no verification is done.
EncodingSelect the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, the PDI client searches your system for available encodings.
Length

Select the length of the field according to its type:

  • Characters
  • Bytes
LimitSpecify a limit on the number of records generated from this step. Specify zero (0) for an unlimited number of records.
Be lenient when parsing dates?Clear the check box if you want strict parsing of data fields. If selected, dates like Jan 32nd become Feb 1st.
The date format LocaleSpecify the locale to use to parse dates written in full, such as February 2nd, 2006. For example, parsing February 2nd, 2006, on a system set to French (fr_FR) would not work because February is called Février in that locale.
Add filenames to resultSelect to add file names to a resulting list of file names.

Error Handling tab

Error Handling tab

In the Error Handling tab, you can specify how the step reacts when errors occur, such as malformed records, bad enclosure strings, wrong number of fields, and premature line ends. The following table contains options for error handling:

OptionDescription
Ignore errors?Select if you want to ignore errors during parsing.
Skip error files?Select if you want to skip those files that contain errors. You can generate a file that contains a listing of files where the errors occur. Otherwise, files with errors are not skipped, and the files that have parsing errors are empty (null).
Error file field nameSpecify an error file name if you want to add field names where errors were occurred.
File error message field nameSpecify an error message field name if you want to add field names where errors occurred in the error file.
Skip error lines?Select if you want to skip those lines that contain errors. You can generate an extra file that contains the line numbers where the errors occur. Otherwise, lines with errors are not skipped, and the fields that have parsing errors are empty (null).
Error count fieldnameSpecify the field name if you want to add a field containing the number of errors on the line to the output rows.
Error fields fieldnameSpecify the field name if you want to add a field containing the names of fields where errors occurred to the output rows.
Error text fieldnameSpecify the field name if you want to add a field containing descriptions of the parsing error occurrences to the output rows.
Warning files directorySpecify the location of the directory where warnings are placed if they are generated. The name of the resulting file is <warning dir>/filename.<date_time>.<warning extension>.
Error files directorySpecify the location of the directory where errors are placed if they occur. The name of the resulting file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>.
Failing line numbers files directorySpecify the location of the directory where parsing errors on a line are placed if they occur. The name of the resulting file is <errorline dir>/filename.<date_time>.<errorline extension>.

Filters tab

Filters tab

The Filters tab contains a table with the columns where you can specify the lines you want to skip in the text file.

ColumnDescription
Filter stringThe string that you want to search for.
Filter positionThe position where the filter string must be placed in the line. Zero (0) is the first position in the line. If you specify a value below zero (0), the filter string is searched for in the entire string.
Stop on filterEnter Y if you want to stop processing the current text file when the filter string is encountered. Enter N to continue processing after encountering the string.
Positive matchEnter Y if you want to process lines that match the filter string. Enter N to ignore matching lines.

Fields tab

Fields tab

The Fields tab contains a table with the columns where you can specify information about the fields being read from the text file.

ColumnDescription
NameThe name of the field.
TypeThe data type of the field.
FormatSee Number formats for a description of format symbols.
PositionThe position is needed when processing the Fixed file type. It is zero based, so the first character is starting with position 0.
Length

The value of this field depends on format:

  • Number

    Total number of significant figures in a number.

  • String

    Total length of string.

  • Date

    Total length of printed output of the string. For example, 4 only returns the year.

Precision

The value of this field depends on format:

  • Number

    Number of floating point digits.

  • String, Date, Boolean

    Unused.

CurrencyUsed to interpret numbers such as $10,000.00 or E5.000,00.
Decimal A decimal point can be a period (.) as in 10,000.00 or it can be a comma (,) as in 5.000,00.
GroupA grouping can be a comma (,) as in 10,000.00 or (.) as in 5.000,00.
Null ifUsed to set as null (empty) if the string is equal to the specified value.
DefaultUsed to specify a default value to use in case the field in the text file was not specified (empty).
Trim typeThe trimming method to apply to a string, which truncates the field before processing. Trimming only works when no field length is specified. You can specify one of the following options:
  • None
  • Left
  • Right
  • Both
RepeatIf the corresponding value in this row is empty, repeat the one from the last time it was not empty (Y or N).

Get Fields (button)

Click to retrieve a list of fields from the input stream.

Minimal width (button)

Click to minimize the field length by removing unnecessary characters. If selected, string fields will no longer be padded to their specified length.

Number formats

Use the following table to specify number formats. For further information on valid numeric formats used in this step, view the Number Formatting Table.

SymbolLocationLocalizedMeaning
0NumberYesDigit.
#NumberYesDigit, zero shows as absent.
.NumberYesDecimal separator or monetary decimal separator.
-NumberYesMinus sign.
,NumberYesGrouping separator.
ENumberYesSeparates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.
;Subpattern boundaryYesSeparates positive and negative patterns.
%Prefix or suffixYesMultiply by 100 and show as percentage.
‰(/u2030)Prefix or suffixYesMultiply by 1000 and show as per mille.
¤ (/u00A4)Prefix or suffixNoCurrency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
Prefix or suffixNoUsed to quote special characters in a prefix or suffix, for example, '#'# formats 123 to #123. To create a single quote itself, use two in a row: # o''clock.

Scientific notation

In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation, for example, 0.###E0 formats the number 1234 as 1.234E3.

Date formats

Use the following table to specify date formats. For further information on valid date formats used in this step, view the Date Formatting Table.

LetterDate of Time ComponentPresentationExamples
GEra designatorTextAD
yYearYear1996 or 96
MMonth in yearMonthJuly, Jul, or 07
wWeek in yearNumber27
WWeek in MonthNumber2
DDay in yearNumber189
dDay in monthNumber10
FDay of week in monthNumber2
EDay in weekTextTuesday or Tue
aam/pm markerTextPM
HHour in day (0-23)Number 0n/a
kHow in day (1-24)Number 24n/a
KHour in am/pm (0-11)Number 0n/a
hHour in am/pm (1-12)Number 12n/a
mMinute in hourNumber 30n/a
sSecond in minuteNumber 55n/a
SMillisecondNumber 978n/a
zTime zoneGeneral time zonePacific Standard Time, PST, or GMT-08:00
ZTime zoneRFC 822 time zone-0800

Additional output fields tab

Additional output fields tab

The Additional output fields tab contains the following options to specify additional information about the file to process.

OptionDescription
Short filename fieldSpecify the field that contains the filename without path information but with an extension.
Extension fieldSpecify the field that contains the extension of the filename.
Path fieldSpecify the field that contains the path in operating system format.
Size fieldSpecify the field that contains the size of the data.
Is hidden fieldSpecify the field indicating if the file is hidden or not (Boolean).
Last modification fieldSpecify the field indicating the date of the last time the file was modified.
Uri fieldSpecify the field that contains the URI.
Root uri fieldSpecify the URI output field name.

Metadata injection support

All fields of this step support metadata injection. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.