Skip to main content
Pentaho Documentation

Using the Hadoop File Input step on the Spark engine

Parent article

You can set up the Hadoop File Input step to run on the Spark engine. Spark processes null values differently than the Pentaho engine, so you may need to adjust your transformation to process null values following Spark's processing rules.

General

Enter the following information in the transformation step name field.

  • Step Name: Specifies the unique name of the transformation step on the canvas. The Step Name is set to Hadoop File Input by default.

Options

The Hadoop File Input step features several tabs with fields for setting environments and defining results. Each tab is described below.

File tab

File tab

In the File tab, you can specify options for the environment and other details for the file you want to input.

OptionDescription
Environment

Select the Hadoop cluster where your file resides. See Connect to a Hadoop cluster with the PDI client for instructions on establishing a connection.

File/FolderSpecify the location and/or name of the text file to read. Click the Ellipsis (…) button to navigate to the source file or folder in the VFS browser.The Spark engine assumes HDFS.
Wildcard (RegExp)Specify the regular expression you want to use to select the files in the directory specified in the File/Folder field. For example, you may want to process all files that have a .txt output. See Selecting a file using regular expressions for examples of regular expressions.
RequiredIndicate if the file is required.
Include subfoldersIndicate if subdirectories (subfolders) are included.

Accepting file names from a previous step

Accept filenames from previous stepsThese fields are not used by the Spark engine.

Show action buttons

Action buttons

When you have entered information in the File tab fields, select one of the following action buttons:

ButtonDescription
Show filename(s)Select to display a list of all files that are loaded based on the current selected file definitions.
Show file contentSelect to display the raw content of the selected file.
Show content from first data lineSelect to display the content from the first data line for the selected file.

Selecting a file using regular expressions

Use the Wildcard (RegExp) field in the File tab to search for files by wildcard in the form of a regular expression. Regular expressions are more sophisticated than using * and ? wildcards. This table describes several examples of regular expressions.

File NameRegular ExpressionFiles Selected
/dirA/.userdata.\.txtFind all files in /dirA/ with names containing userdata and ending with .txt
/dirB/AAA.\*Find all files in /dirB/ with names that start with AAA
/dirC/\[ENG:A-Z\]\[ENG:0-9\].\*Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9)
/dirA/part-.*Find all the Spark part files under the directory /dirA/

Open file

When you select S3 in the Environment field, and then select the Ellipsis (…) button in the File/Folder field, the Open File dialog box appears. The fields in the Open File dialog box are not used by the Spark engine.

Content tab

Content tab

In the Content tab, you can specify the format of the text files that are being read.

OptionDescription
FiletypeSelect either CSV or Fixed length. Based on this selection, the PDI client launches a different helper GUI when you click Get Fields in the Fields tab.
SeparatorOne or more characters that separate the fields in a single line of text. Typically, this is a semicolon ( ; ) or tab.
EnclosureSome fields can be enclosed by a pair of strings to allow separator characters in fields. The enclosure string is optional.
Allow breaks in enclosed fieldsThis field is either not used by the Spark engine or not implemented for Spark on AEL.
EscapeSpecify an escape character (or characters) if you have these types of characters in your data. If you have a backslash ( / ) as an escape character, the text Not the nine o\'clock news (with a single quote \[ ' \] as the enclosure) is parsed as Not the nine o'clock news.
Header & Number of header linesSelect if your text file has a header row (first lines in the file). Set Header & Number of header lines to 1 (one).
Footer & Number of footer linesThese fields are either not used by the Spark engine or not implemented for Spark on AEL.
Wrapped lines & Number of times wrappedThese fields are either not used by the Spark engine or not implemented for Spark on AEL.
Paged layout (printout), Number of lines per page, & Document header linesThese fields are either not used by the Spark engine or not implemented for Spark on AEL.
CompressionThis field is either not used by the Spark engine or not implemented for Spark on AEL.
No empty rowsThis field is either not used by the Spark engine or not implemented for Spark on AEL.
Include filename in output?This field is either not used by the Spark engine or not implemented for Spark on AEL.
Filename fieldnameThis field is either not used by the Spark engine or not implemented for Spark on AEL.
Rownum in output?This field is either not used by the Spark engine or not implemented for Spark on AEL.
Rownum fieldname & Rownum by file?These fields are either not used by the Spark engine or not implemented for Spark on AEL.
FormatSelect UNIX.
Encoding & LimitThese fields are either not used by the Spark engine or not implemented for Spark on AEL.
Be lenient when parsing dates?This field is either not used by the Spark engine or not implemented for Spark on AEL.
The date format LocaleThis field is either not used by the Spark engine or not implemented for Spark on AEL.
Add filenames to resultThis field is either not used by the Spark engine or not implemented for Spark on AEL.

Error Handling tab

Error Handling tab

These fields are not used by the Spark engine.

Filters tab

Filters tab

Pentaho Engine: In the Filters tab, you can specify the lines you want to skip in the text file.

OptionDescription
Filter stringThe string for which to search.
Filter positionThe position where the filter string must be placed in the line. Zero (0) is the first position in the line. If you specify a value below zero, the filter string is searched for in the entire string.
Stop on filterEnter Y here if you want to stop processing the current text file when the filter string is encountered.
Positive matchTurns filters into positive mode when turned on. Only lines that match this filter will be passed. Negative filters take precedence and are immediately discarded.

Fields tab

Fields tab

In the Fields tab, you can specify the information about the name and format of the fields being read from the text file.

OptionDescription
NameName of the field.
TypeType of the field can be either String, Date, or Number.
FormatSee Number formats for a complete description of format symbols.
PositionThe position is needed when processing the Fixed filetype. It is zero-based, so the first character is starting with position 0.
Length

The value of this field depends on format:

  • Number

    Total number of significant figures in a number.

  • String

    Total length of string.

  • Date

    Total length of printed output of the string. For example, 4 only returns the year.

Precision

The value of this field depends on format:

  • Number

    Number of floating point digits.

  • String, Date, Boolean

    Unused.

CurrencyUsed to interpret numbers such as $10,000.00 or E5.000,00.
DecimalA decimal point can be a period (.) as in 10,000.00 or it can be a comma (,) as in 5.000,00.
GroupA grouping can be a comma (,) as in 10,000.00 or a period (.) as in 5.000,00.
Null ifTreat this value as null.
DefaultDefault value in case the field in the text file was not specified (empty).
Trim type

Trim the type before processing. You can specify one of the following options:

  • None
  • Left
  • Right
  • Both
RepeatIf the corresponding value in this row is empty, repeat the one from the last time it was not empty (Y or N).

Number formats

Use the following table to specify number formats. For further information on valid numeric formats used in this step, view the Number Formatting Table.

SymbolLocationLocalizedMeaning
0NumberYesDigit.
#NumberYesDigit, zero shows as absent.
.NumberYesDecimal separator or monetary decimal separator.
-NumberYesMinus sign.
,NumberYesGrouping separator.
ENumberYesSeparates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.
;Subpattern boundaryYesSeparates positive and negative patterns.
%Prefix or suffixYesMultiply by 100 and show as percentage.
‰(/u2030)Prefix or suffixYesMultiply by 1000 and show as per mille.
¤ (/u00A4)Prefix or suffixNoCurrency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
Prefix or suffixNoUsed to quote special characters in a prefix or suffix, for example, '#'# formats 123 to #123. To create a single quote itself, use two in a row: # o''clock.

Scientific notation

In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation, for example, 0.###E0 formats the number 1234 as 1.234E3.

Date formats

Use the following table to specify date formats. For further information on valid date formats used in this step, view the Date Formatting Table.

LetterDate of Time ComponentPresentationExamples
GEra designatorTextAD
yYearYear1996 or 96
MMonth in yearMonthJuly, Jul, or 07
wWeek in yearNumber27
WWeek in MonthNumber2
DDay in yearNumber189
dDay in monthNumber10
FDay of week in monthNumber2
EDay in weekTextTuesday or Tue
aam/pm markerTextPM
HHour in day (0-23)Number 0n/a
kHour in day (1-24)Number 24n/a
KHour in am/pm (0-11)Number 0n/a
hHour in am/pm (1-12)Number 12n/a
mMinute in hourNumber 30n/a
sSecond in minuteNumber 55n/a
SMillisecondNumber 978n/a
zTime zoneGeneral time zonePacific Standard Time, PST, or GMT-08:00
ZTime zoneRFC 822 time zone-0800

Metadata injection support

All fields of this step support metadata injection except for the Hadoop Cluster field. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.