Bulk load into Snowflake

Last updated
Save as PDF

The Bulk load into Snowflake job entry in PDI loads vast amounts of data into a Snowflake virtual warehouse in a single session. For more information about working with Snowflake in PDI, see PDI and Snowflake.

This entry automates Snowflake's COPY INTO command to populate your Snowflake data warehouse with your PDI data, eliminating the need for repetitive SQL scripting. To use the Bulk load into Snowflake job entry, you must size your virtual warehouse, define the source and type of data to load, specify the target data warehouse, then provide any needed parameters.

For more information about using Snowflake, including best practices for bulk loading data, see the Snowflake documentation.

Before you begin

Before using the Bulk load into Snowflake job entry in PDI, you must have the following items:

A Snowflake account.
If you are using Pentaho 8.3, you must have installed the Snowflake plugin. See Install the Snowflake plugin in Pentaho 8.3 for download and installation instructions.
A connection to the database where you want to load your data.
An S3 connection or a VFS connection to a Snowflake staging area for your data source.
A table and schema set up in the database where you want to place your data. On the first use, you will need to create the table.

General

The following field is general to this job entry:

Entry Name: Specify the unique name of the job entry on the canvas. You can customize the name or leave it as the default.

Options

The Bulk load into Snowflake entry includes four tabs to define the data input source, the output database and location, options, and advanced options for loading PDI data into Snowflake. Each tab is described below.

Input tab

Use the options in this tab to define your input source for the Snowflake COPY INTO command:

Option	Description
Source	Choose from one of the following input source types S3 The input source is an S3 bucket. Snowflake Staging Area The input source is files on a Snowflake staging area. Click Select to specify the file, folder, prefix, or variable of the S3 bucket or staging location to use as the input for the Snowflake COPY INTO command. See "Syntax" in the Snowflake documentation for more details specifying this option.
What file type is your source	Select the file type of the input source. You can select one of the following types: Delimited text The input source is character-delimited UTF-8 text. Avro The input source is an Avro data serialization protocol. JSON The input source is a JavaScript Object Notation (JSON) data file containing a set of either objects or arrays. ORC The input source is an Optimized Row Columnar (ORC) file containing Hive data. Parquet The input source is a Parquet file of nested data structures in a flat columnar format. XML The input source is a file in XML format.
Compression	Select the type of compression applied to your input source: None Auto BZIP2 GZIP Deflate Raw deflate Brotli Zstd For Parquet files, the Compression options are: None Auto Snappy

Depending on what file type you selected for What file type is your source?, the following file settings appear at the bottom of this tab:

File Type	File Settings
Delimited text	Specify the following settings for a delimited text file: Leading rows to skip Specify the number of rows to use as an offset from the beginning of the file. This option is useful to skip header lines. Delimiter Specify the character used to separate a data field. Default value is semicolon (;). Quote character Specify the character used to enclose a data field. Default value is double-quotation mark (″). Remove quotes Select one of the following values to indicate whether quotation characters should be removed from a data field during the bulk load: Yes: Remove the quotation characters. No: Retain the quotation characters. Empty as null Select one of the following values to indicate whether empty data values should be set to null during the bulk load: Yes: sets empty data values to null. No: leaves data values as empty. Trim whitespace Select one of the following values to remove trailing and leading whitespace from the data during the bulk load: Yes: Remove the whitespace. No: Retain the whitespace. NoteFor delimited text files, you must have a table in your database with all the columns you need defined.
Avro	No additional settings.
JSON	Ignore UTF8 errors Select one of the following values to ignore UTF8 errors in the data during the bulk load: Yes: Ignore UTF8 errors. No: Do not ignore UTF8 errors. Allow duplicate elements Select one of the following values to allow duplicate elements in the data during the bulk load: Yes: Allow duplicate elements. No: Do not allow duplicate elements. NoteSnowflake only uses the last duplicate value and discards the others. Strip null values NULL values are stored as null in JSON files. Select one of the following values to indicate whether to delete NULL values from the data during the bulk load: Yes: Strip the NULL values. No: Store the NULL values in a variant column. Parse octal numbers Select one of the following values to indicate whether to parse octal numbers during the bulk load: Yes: Parse octal numbers. No: Do not parse octal numbers.
ORC	Additional file settings for ORC files.
Parquet	Additional file settings for Parquet files.
XML	Ignore UTF8 errors Select one of the following values to indicate whether to replace UTF-8 encoding errors during the bulk load: Yes: Replace invalid UTF-8 sequences with Unicode character U+FFFD. No: Invalid UTF-8 sequences produce an encoding error (default). Preserve space Select one of the following values to indicate whether to preserve leading and trailing spaces in element content during the bulk load: Yes: Preserve spaces. No: Do not preserve spaces (default). Strip outer element Select one of the following values to indicate whether to remove the outer XML element, and expose the second level elements as separate documents during the bulk load: Yes: Remove the outer XML element. No: Do not remove the outer XML element (default). Enable Snowflake data Select one of the following values to indicate whether to enable recognition of Snowflake semi-structured data tags from the data during the bulk load: Yes: Enable recognition of Snowflake tags (default). No: Disable recognition of Snowflake tags. Auto convert Select one of the following values to indicate whether to convert numeric and Boolean values from text to native representation during the bulk load: Yes: Convert numeric and Boolean values (default). No: Do not convert numeric and Boolean values.

NoteIf you have unstructured data, you must have a variant column in your database table to store the data for the following file types:

JSON
ORC
Parquet
XML

Output tab

Option	Description
Database connection	Select your database connection from a list of existing Snowflake connections. If you do not have an existing connection, click New. If you need to modify an existing connection, click Edit. See Define Data Connection for instructions. Note: If timeout errors occur, see Snowflake timeout errors to troubleshoot.
Schema	Select the schema to use for the bulk load. The Bulk load into Snowflake entry reads the schemas that exist in the database to populate this list.
Table name	Select the name of the table to bulk load. The Bulk load into Snowflake entry reads the table names from your selected schema to populate this list. NoteYou can only load one table at a time. For multiple tables, use multiple Bulk load into Snowflake entries or parameterized schema and table names to change the table that is loaded.

NoteYou can use variables in the Schema and Table name fields.

Options tab

Use the options in this tab to define how the data is loaded using the Snowflake COPY INTO command:

Option	Description
Truncate table	Select to remove all the data in the table before bulk loading the current data. When the Truncate table option is cleared, the data is appended during a new data load (default).
On error	Select one of the following values to indicate what to do if an error occurs during during the bulk load: Abort: Stops loading data into the warehouse. Skip file: Skips loading the file that caused the error and continues processing. Skip file number: Number of errors that can occur before the file is skipped in the loading process. Skip file percent: Percentage of errors in the record count that can occur before processing stops. Continue: Ignore the errors and continues processing.
Skip file after (x) errors	Enter the number of errors than can occur before the file is skipped in the loading process. This field displays when the Skip file number field is selected.
Skip file after (%) errors	Enter the percentage of errors than can occur before the file is skipped in the loading process. This field displays when the Skip file percent field is selected.

Advanced options tab

Advanced Options tab

Use this tab to configure parameters for the Snowflake COPY INTO command. Any Name/Value pair added as a parameter is passed to Snowflake as a parameter, but the validation of the parameter is the user’s responsibility. See Snowflake’s documentation for further details on these parameters. The Force paramenter here is provided as an example.

Parameter	Description
Force	Specify to force loading of files into a database. True: Loads data to the table even if the data has already been loaded from that file before. This option can potentially duplicate data in a table. False: Ignores staged data files already loaded into the table (default).

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.