Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Using the Hadoop File Input step on the Pentaho engine

Parent article

If you are running your transformation on the Pentaho engine, use the following instructions to set up the Hadoop File Input step.

General

Enter the following information in the transformation step name field.

  • Step Name: Specifies the unique name of the transformation step on the canvas. The Step Name is set to Hadoop File Input by default.

Options

The Hadoop File Input step features several tabs with fields for setting environments and defining results. Each tab is described below.

File tab

File tab

In this tab, specify the environment and other details for the file you want to input.

OptionDescription
Environment

Indicate the file system or specific cluster on which the item you want to input can be found. Options are:

  • Local

    Specifies that the item entered in the File/Folder field is in a file system that is local to the PDI client (Spoon).

  • <Static>

    Specifies that the item entered in the File/Folder field should use the path name in that field. Use this option if you already know a file path and you want to copy and paste it into the window.

  • S3

    Specifies that the item entered in the File/Folder field is in a file system that is on the S3 file system.

  • <Hadoop Cluster Name>

    Specifies that the item entered in the File/Folder field is in the cluster indicated.

File/FolderSpecify the location and/or name of the text file to read. Click the Ellipsis (...) button to navigate to the source file or folder in the VFS browser.
Wildcard (RegExp)Specify the regular expression you want to use to select the files in the directory specified in the File/Folder field. For example, you may want to process all files that have a .txt output. See Selecting a file using regular expressions for examples of regular expressions.
RequiredIndicate if the file is required.
Include subfoldersIndicate if subdirectories (subfolders) are included.

Accepting file names from a previous step

Accept filenames from previous steps

The Accept filenames from previous steps section in the File tab provides flexibility in combination with other steps, such as Get File Names. You can specify your file name and pass it to this step. Using this method, the file name can come from any source, such as a text file or database table.

OptionDescription
Accept file names from previous stepsSelect the check box to get file names from previous steps.
Pass through fields from previous stepSelect the check box to get field information from previous steps.
Step to read file names fromEnter the name of the step from which to read the file names.
Field in the input to use as file nameText File Input looks in this step to determine which file names to use.

Show action buttons

Action buttons

When you have entered information in the File tab fields, select one of the following action buttons:

ButtonDescription
Show filename(s)Select to display a list of all files that are loaded based on the current selected file definitions.
Show file contentSelect to display the raw content of the selected file.
Show content from first data lineSelect to display the content from the first data line for the selected file.

Selecting a file using regular expressions

Use the Wildcard (RegExp) field in the File tab to search for files by wildcard in the form of a regular expression. Regular expressions are more sophisticated than using * and ? wildcards. This table describes several examples of regular expressions.

File NameRegular ExpressionFiles Selected
/dirA/.userdata.\.txtFind all files in /dirA/ with names containing userdata and ending with .txt
/dirB/AAA.\*Find all files in /dirB/ with names that start with AAA
/dirC/\[ENG:A-Z\]\[ENG:0-9\].\*Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9)

Open file

When you select S3 in the Environment field, and then select the Ellipsis (…) button in the File/Folder field, the Open File dialog box appears. Perform the following steps to open a file. Open File dialog box

Procedure

  1. In the Connection section, fill in the following options.

    OptionDescription
    Access KeyEnter the user name needed to access the S3 file system. This option only appears if you select S3 in the Source Environment field in the Hadoop File Input window.
    Secret KeyEnter the password needed to access the S3 file system. This option only appears if you select S3 in the Source Environment field in the Hadoop File Input window.
    Open from FolderIndicates the path and name of the directory you want to browse. This directory becomes the active directory.
  2. In the Open from Folder field, navigate to the path and name of the directory you want to browse. This directory becomes the active directory.

  3. Use the following options to view and modify the active directory selected in the Open from Folder field:

    OptionDescription
    Up One Level iconClick this button to display the parent directory of the active directory shown in the Open from Folder field.
    Delete iconClick this button to delete a folder from the active directory.
    Create Folder iconClick this button to create a new folder in the active directory.
    Name/Type/ModifiedDisplay the active directory, which is the one that is listed in the Open from Folder field. The file type and last modified date display to the right of the folder or file in the Name list.
    FilterApply a filter to the results displayed in the active directory contents.
  4. Click OK to continue, or Cancel to return to the File tab without saving your selections.

Content tab

Content tab

In the Content tab, you can specify the format of the text files that are being read.

OptionDescription
FiletypeSelect either CSV or Fixed length. Based on this selection, the PDI client launches a different helper GUI when you click Get Fields in the Fields tab.
SeparatorOne or more characters that separate the fields in a single line of text. Typically, this is a semicolon ( ; ) or tab.
EnclosureSome fields can be enclosed by a pair of strings to allow separator characters in fields. The enclosure string is optional.
Allow breaks in enclosed fieldsNot implemented.
EscapeSpecify an escape character (or characters) if you have these types of characters in your data. If you have a backslash ( / ) as an escape character, the text Not the nine o\'clock news (with a single quote \[ ' \] as the enclosure) is parsed as Not the nine o'clock news.
Header & Number of header linesSelect if your text file has a header row (first lines in the file). You can specify the number of times the header line appears.
Footer & Number of footer linesSelect if your text file has a footer row (last lines in the file). You can specify the number of times the footer row appears.
Wrapped lines & Number of times wrappedSelect if you work with data lines that have wrapped beyond a specific page limit. Headers and footers are never considered wrapped.
Paged layout (printout), Number of lines per page, & Document header linesUse these options as a last resort when working with texts meant for printing on a line printer. Use the number of document header lines to skip introductory texts and the number of lines per page to position the data lines.
CompressionUse this field if your text file is in a ZIP or GZIP archive. Only the first file in the archive is read.
No empty rowsSelect if you do not want to send empty rows to the next steps.
Include filename in output?Select if you want the file name to be part of the output.
Filename fieldnameEnter the name of the field that contains the file name.
Rownum in output?Select if you want the row number to be part of the output.
Rownum fieldname & Rownum by file?Enter the name of the field that contains the row number.
FormatCan be either DOS, UNIX, or mixed. UNIX files have lines that are terminated by line feeds. DOS files have lines separated by carriage returns and line feeds. If you specify mixed, no verification is done.
Encoding & LimitSpecify the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, the PDI client searches your system for available encodings.
Be lenient when parsing dates?Clear check box if you want strict parsing of data fields. If selected, dates like Jan 32nd become Feb 1st.
The date format LocaleThis locale is used to parse dates that have been written in full such as February 2nd, 2016. Parsing this date on a system running in the French (fr_FR) locale would not work because February is called Février in that locale.
Add filenames to resultAdds filenames to generate a filenames list.

Error Handling tab

Error Handling tab

In the Error Handling tab, you can specify how the step reacts when errors occur, such as malformed records, bad enclosure strings, wrong number of fields, and premature line ends.

OptionDescription
Ignore errors?Select if you want to ignore errors during parsing.
Skip error lines?Select if you want to skip those lines that contain errors. You can generate an extra file that contains the line numbers where the errors occur. Lines with errors are not skipped. The fields that have parsing errors are empty (null).
Error count field nameAdd a field to the output stream rows. This field contains the number of errors on the line.
Error fields field nameAdd a field to the output stream rows. This field contains the field names on which an error occurred.
Error fields text field nameAdd a field to the output stream rows. This field contains the descriptions of the parsing errors that have occurred.
Warnings file directoryWhen warnings are generated, they are placed in this directory. The name of that file is <warning dir>/filename.<date_time>.<warning extension>.
Error files directoryWhen errors occur, they are placed in this directory. The name of the file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>.
Failing line numbers files directoryWhen a parsing error occurs on a line, the line number is placed in this directory. The name of that file is <errorline dir>/filename.<date_time>.<errorline extension>.

Filters tab

Filters tab

In the Filters tab, you can specify the lines you want to skip in the text file.

OptionDescription
Filter stringThe string for which to search.
Filter positionThe position where the filter string must be placed in the line. Zero (0) is the first position in the line. If you specify a value below zero, the filter string is searched for in the entire string.
Stop on filterEnter Y here if you want to stop processing the current text file when the filter string is encountered.
Positive matchTurns filters into positive mode when turned on. Only lines that match this filter will be passed. Negative filters take precedence and are immediately discarded.

Fields tab

Fields tab

In the Fields tab, you can specify the information about the name and format of the fields being read from the text file.

OptionDescription
NameName of the field.
TypeType of the field can be either String, Date, or Number.
FormatSee Number formats for a complete description of format symbols.
PositionThe position is needed when processing the Fixed filetype. It is zero-based, so the first character is starting with position 0.
Length

The value of this field depends on format:

  • Number

    Total number of significant figures in a number.

  • String

    Total length of string.

  • Date

    Total length of printed output of the string. For example, 4 only returns the year.

Precision

The value of this field depends on format:

  • Number

    Number of floating point digits.

  • String, Date, Boolean

    Unused.

CurrencyUsed to interpret numbers such as $10,000.00 or E5.000,00.
DecimalA decimal point can be a period (.) as in 10,000.00 or it can be a comma (,) as in 5.000,00.
GroupA grouping can be a comma (,) as in 10,000.00 or a period (.) as in 5.000,00.
Null ifTreat this value as null.
DefaultDefault value in case the field in the text file was not specified (empty).
Trim type

Trim the type before processing. You can specify one of the following options:

  • None
  • Left
  • Right
  • Both
RepeatIf the corresponding value in this row is empty, repeat the one from the last time it was not empty (Y or N).

Number formats

Use the following table to specify number formats. For further information on valid numeric formats used in this step, view the Number Formatting Table.

SymbolLocationLocalizedMeaning
0NumberYesDigit.
#NumberYesDigit, zero shows as absent.
.NumberYesDecimal separator or monetary decimal separator.
-NumberYesMinus sign.
,NumberYesGrouping separator.
ENumberYesSeparates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.
;Subpattern boundaryYesSeparates positive and negative patterns.
%Prefix or suffixYesMultiply by 100 and show as percentage.
‰(/u2030)Prefix or suffixYesMultiply by 1000 and show as per mille.
¤ (/u00A4)Prefix or suffixNoCurrency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
Prefix or suffixNoUsed to quote special characters in a prefix or suffix, for example, '#'# formats 123 to #123. To create a single quote itself, use two in a row: # o''clock.

Scientific notation

In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation, for example, 0.###E0 formats the number 1234 as 1.234E3.

Date formats

Use the following table to specify date formats. For further information on valid date formats used in this step, view the Date Formatting Table.

LetterDate of Time ComponentPresentationExamples
GEra designatorTextAD
yYearYear1996 or 96
MMonth in yearMonthJuly, Jul, or 07
wWeek in yearNumber27
WWeek in MonthNumber2
DDay in yearNumber189
dDay in monthNumber10
FDay of week in monthNumber2
EDay in weekTextTuesday or Tue
aam/pm markerTextPM
HHour in day (0-23)Number 0n/a
kHour in day (1-24)Number 24n/a
KHour in am/pm (0-11)Number 0n/a
hHour in am/pm (1-12)Number 12n/a
mMinute in hourNumber 30n/a
sSecond in minuteNumber 55n/a
SMillisecondNumber 978n/a
zTime zoneGeneral time zonePacific Standard Time, PST, or GMT-08:00
ZTime zoneRFC 822 time zone-0800

Metadata injection support

All fields of this step support metadata injection except for the Hadoop Cluster field. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.