Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Bulk load into Amazon Redshift

Parent article

The Bulk load into Amazon Redshift entry leverages Amazon Redshift's COPY command for greater automation while populating your Amazon Redshift cluster with your PDI data, eliminating the need for repetitive SQL scripting. By using the Redshift COPY command, this entry can take advantage of parallel loading and cloud storage for high performance processing.

Before you begin

Before using the Bulk load into Amazon Redshift entry in PDI, you must perform the following steps to download the Redshift driver and copy into the Pentaho distribution:

Procedure

  1. Download the latest JDBC 4.0 driver from the Download the Amazon Redshift JDBC Driver site.

  2. Copy the downloaded JAR file into the server/pentaho-server/tomcat/lib directory to update the Pentaho Server.

  3. Copy the downloaded JAR file into the design-tools/data-integration/lib directory to update PDI.

  4. Verify that your cluster is configured to support the COPY command.

    See Credentials and Access Permissions for more details.

Results

You can now use the Bulk load into Amazon Redshift entry in your PDI jobs.

General

The following field is general to this job entry:

  • Entry Name: Specify the unique name of the job entry on the canvas. You can customize the name or leave it as the default.

Options

The Bulk load into Amazon Redshift entry includes several tabs to define the data input source, the output database and location, and other Redshift COPY command parameters for loading PDI data into Amazon Redshift. Each tab is described below.

Input tab

Input tab

Use the options in this tab to define your input source for the Redshift COPY command:

OptionDescription
SourceSpecify the file, folder, prefix, or variable of the S3 bucket to use as the input for the Redshift COPY command. See Syntax of the COPY from Amazon S3 for more details on how to specify this option.
What file type is your sourceSelect the file type of the input source. You can select one of the following types:
  • Delimited text

    The input source is character-delimited UTF-8 text.

  • Avro

    The input source is an Avro data serialization protocol.

  • JSON

    The input source is a JavaScript Object Notation (JSON) data file containing a set of either objects or arrays.

  • ORC

    The input source is an Optimized Row Columnar (ORC) file containing Hive data. See Hive for further configuration information when using Hive with Spark on AEL.

  • Parquet

    The input source is a Parquet file of nested data structures in a flat columnar format.

  • Fixed width

    The input source is a file where each column width is a fixed length, rather than columns separated by a delimiter.

CompressionSelect the type of compression applied to your input source:
  • None

    No compression is applied.

  • BZIP2

    The Burrows–Wheeler algorithm is used to compress the input source.

  • GZIP

    The DEFLATE algorithm is used as a basis for compressing the input source.

  • LZOP

    The Lempel–Ziv–Oberhumer lossless algorithm is used as a basis for compressing the input source.

Depending on what file type you selected for What file type is your source, the following file settings appear at the bottom of this tab:

File TypeFile Settings
Delimited textSpecify the following settings for a delimited text file:
  • Leading rows to skip

    Specify the number of rows to use as an offset from the beginning of the file.

  • Delimiter

    Specify the character used to separate a data field.

  • Quote character

    Specify the character used to enclose a data field.

  • Remove quotes

    Select one of the following values to indicate whether quotation characters should be removed from a data field during the bulk load:

    • Yes: removes the quotation characters.
    • No: retains the quotation characters.
  • Ignore blank lines

    Select one of the following values to indicate whether blank lines should be removed from the data during the bulk load:

    • Yes: removes the blank lines.
    • No: retains the blank lines.
  • Empty as null

    Select one of the following values to indicate whether empty data values should be set to null during the bulk load:

    • Yes: sets empty data values to null.
    • No: leaves data values as empty.
AvroSelect one of the following values for the Empty as null option to indicate whether empty data values should be set to null during the bulk load:
  • Yes: sets empty data values to null.
  • No: leaves data values as empty.
JSONSelect one of the following values for the Empty as null option to indicate whether empty data values should be set to null during the bulk load:
  • Yes: sets empty data values to null.
  • No: leaves data values as empty.
ORCAdditional file settings handled by the ORC file.
ParquetAdditional file settings handled by Parquet file.
Fixed widthSpecify the following settings for a fixed width file:
  • Columns:Width

    Specify the columns and their fixed widths.

  • Leading rows to skip

    Specify the number of rows to use as an offset from the beginning of the file.

  • Ignore blank lines

    Select one of the following values to indicate whether blank lines should be removed from the data during the bulk load:

    • Yes: removes the blank lines.
    • No: retains the blank lines.
  • Empty as null

    Select one of the following values to indicate whether empty data values should be set to null during the bulk load:

    • Yes: sets empty data values to null.
    • No: leaves data values as empty.

Output tab

Output tab

Use the options in this tab to define how the data is loaded using the Redshift COPY command:

OptionDescription
Database connectionSelect your database connection from a list of existing Redshift connections.

If you do not have an existing connection, click New. If you need to modify an existing connection, click Edit. The Database Connection dialog box appears.

An exclamation mark appears in the title of theOutput tab if a connection cannot be established to the selected database. In this case, click Edit to modify your connection settings.

You must specify your IAM-role based access or credentials in your database connection for Redshift. Within the Database Connection dialog box, select either Role or Credentials. For your S3 role, specify the IAM-role in Role Name. For your S3 credentials, specify the Access Key ID and Secret Access Key.

SchemaSelect the schema to use for the bulk load.
Table nameSelect the name of the table to bulk load. You can only load one table at a time. You will need to use multiple Bulk load into Amazon Redshift entries or loop over a single entry to load multiple tables.
ColumnsPreview columns within your selected Table name.

Options tab

Options tab

Use the option in this tab to define how tabular data is loaded using the Redshift COPY command:

OptionDescription
Truncate tableSelect to remove all the data in the table before bulk loading the current data. When the Truncate table option is cleared, the data is appended during a new data load. The default is cleared.

Parameters tab

Parameters tab

Use this tab to configure additional parameters for the Redshift COPY command. See Redshift's COPY Parameter Reference for further details on these parameters.