Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Hadoop File Output

Parent article

The Hadoop File Output step exports data to text files stored on a Hadoop cluster. It is commonly used to generate comma separated values (CSV files) that are easily read by spreadsheet applications. You can also generate fixed-width files by setting lengths on the fields in the Fields tab.

AEL considerations

When using the Hadoop File Output step with the Adaptive Execution Layer, the following factors affect performance and results:

  • Spark processes null values differently than the Pentaho engine. You will need to adjust your transformation to successfully process null values according to Spark's processing rules.
  • The Accept file name from field? option cannot be used with Spark on AEL.

General

Enter the following information in the transformation step name field.

  • Step Name: Specifies the unique name of the Hadoop File Output step on the canvas. You can customize the name or leave it as the default

Options

The Hadoop File Output transformation step features several tabs with fields. Each tab is described below.

File tab

File tab

The File tab contains the following options that define the basic properties for the file being created:

OptionDescription
Hadoop ClusterSpecifies which Hadoop cluster configuration to use. You can specify information like host names and ports for HDFS, Job Tracker, and other big data cluster components through the Hadoop Cluster configuration dialog box. Click Edit to edit an existing cluster configuration in the dialog box, or click New to create a new configuration with the dialog box. Once created, Hadoop cluster configurations settings can be reused by other transformation steps and job entries. See Connect to a Hadoop cluster with the PDI client for more details on the configuration settings.
Folder/FileSpecify the location and/or name of the output text file written to the Hadoop Cluster. Click Browse to display and enter the file details using the Using the virtual file system browser in PDI.
Create Parent FolderIndicates a parent folder should be created for the output text file.
Do not create file at startAvoids empty files when no rows are processed.
Accept file name from field?Indicates you want to specify the file name(s) in a field in the input stream.

This setting can be fine-tuned with the kettle.properties file. See Improving performance when writing multiple files .

File name fieldSpecifies the field that contains the filename(s) in the input stream during runtime.
ExtensionAdds an extension to the end of the file name. The default is .txt.
Include stepnr in filenameIncludes the copy number in the file name (_0 for example) when you run the step in multiple copies (launching several copies of a step).
Include partition nr in file name?Includes the data partition number in the file name.
Include date in file nameIncludes the system date in the filename (_20181231 for example).
Include time in file nameIncludes the system time in the filename (_235959 for example).
Specify Date time formatIndicates you want to specify the date time format from the list in the Date time format drop-down list.
Date time formatSpecifies date time formats.
Show file name(s)Displays a list of the files generated. The list is a simulation and depends on the number of rows that go into each file.
Add filenames to resultAdds the filename to the internal file result set.

Content tab

Content tab

The Content tab contains the following options for describing the content written to the output text file:

OptionDescription
AppendAppends lines to the end of the specified file.
SeparatorSpecifies the character that separates the fields in a single line of text. Typically, it is a semicolon (;) or a tab. Click Insert TAB to place a tab in the Separator field.
EnclosureEncloses fields with a pair of specified strings. It allows for separator characters in fields. This setting is optional and can be left blank.
Force the enclosure around fields?Forces all field names to be enclosed with the character specified in the Enclosure property.
HeaderIndicates the output text file has a header row (first line in the file).
FooterIndicates the output text file has a footer row (last line in the file).
FormatSpecifies the type of formatting to use. It can be either DOS or UNIX. UNIX files have lines separated by line feeds, while DOS files have lines separated by carriage returns and line feeds.
CompressionSpecifies the type of compression (ZIP or GZIP) to use when compressing the output. Only one file is placed in a single archive.
EncodingSpecifies the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, PDI searches your system for available encodings.
Right pad fieldsAdds spaces to the end of the fields (or removes characters at the end) until the length specified in the table under the Fields tab is reached.
Fast data dump (no formatting)Improves the performance when dumping large amounts of data to a text file by not including any formatting information.
Split every ... rowsIf the number N is larger than zero, splits the output text file into multiple parts of N rows.
Add Ending line of fileSpecifies an alternate ending row to the output file.

Fields tab

The Fields tab is where you define properties for the fields being exported. The following table describes each field:

FieldDescription
NameThe name of the field
TypeType of the field can be either String, Date or Number.
FormatAn optional mask for converting the format of the original field.
Length

The length of the field depends on the following field types:

  • Number

    Total number of significant figures in a number.

  • String

    Total length of string.

  • Date

    Length of printed output of the string (for example, four is a length for a year).

PrecisionNumber of floating point digits for number-type fields.
CurrencySymbol used to represent currencies ($5,000.00 or €5.000,00 for example).
DecimalA decimal point can be a period (.) or comma (,) (5,000.00 or 5.000,00 for example).
GroupA grouping can be a (,) or (.) (5,000.00 or 5.000,00 for example).
Trim TypeThe trimming method to apply to a string. Trimming only works when no field length is specified.
NullIf the value of the field is null, the specified string is inserted into the output text file.

Metadata injection support

All fields of this step support metadata injection. You can use this step with ETL metadata injection to pass metadata to your transformation at runtime.