Skip to main content
Pentaho Documentation

ElasticSearch Bulk Insert

Overview

 

Explains how to use the ElasticSearch Bulk Insert step.

Elastic is a platform that consists of products that search, analyze, and visualize data.  The Elastic platform includes ElasticSearch, which is a Lucene-based, multi-tenant capable, and distributed search and analytics engine

Description

The ElasticSearch Bulk Insert step sends one or more batches of records to an ElasticSearch server for indexing.  Because you can specify the size of a batch, you can use this step to send one, a few, or many records to ElasticSearch for indexing. 

Context

Use this step if you have records that you want to submit to an ElasticSearch server to be indexed. When record data flows out of the ElasticSearch Bulk Insert step, PDI sends it to ElasticSearch along with metadata that you indicate such as the index and type. This step is commonly used when you want to send a batch of data to an ElasticSearch server and create new indexes of a certain type (category). It is also used when you want to add a batch of data to an index or category. 

Because this is an output step, it is often placed at the end of the transformation.

Since ElasticSearch has a REST web interface you can also use the REST Client step to send data to an ElasticSearch server and to perform other REST functions.

Prerequisites

You need:

  • A working server that has ElasticSearch version 2.2.0 already installed.  You should be able to connect to ElasticSearch from the computer that you are running PDI on.
  • Insert, Update, and Create privileges for the directories on the ElasticSearch server that you need to access.
  • Files or data you want ElasticSearch to index.

Options

This step consists of four tabs: General, Servers, Fields, and Settings.

General Tab

elasticsearch_bulk_insert_general.png

Option Description
Step name Indicates the name given to this step.
Help Displays help documentation.
OK Saves the information you entered, then closes the window.
Cancel Discards the changes you entered, then closes the window.
Index Specifies the name of the index you want to add data to.   If an index with that name doesn't yet exist in ElasticSearch, it creates one.
Type Indicates the category the data should be placed in.  You define the category.  In general practice, the type sometimes describes the data. For example, if the index is "twitter" the type might be "tweet."
Test Index Checks whether the index exists in ElasticSearch.  
Batch Size Indicates the number of items in the batch.  (If you set the batch size is set to one, it is not a bulk insert, but setting it to a higher number is.)
Stop on Error Stops processing if there is an error, such as a problem with adding the document or the bulk push to the index or if the JSON is not well-formed.  If this option is not selected, and an error occurs, the row is not processed, but the transformation keeps running so that other rows are processed.
Batch Timeout Indicates how long batch should be processed before the batch times out, and processing ends.
ID Field Indicates the name of the ID field in the file.
Overwrite if exists If the output file exists because this transformation was run before, allows the output to be overwritten.
Output Rows Sends the rows that are successfully processed by ElasticSearch to the to the next step (or the output).  If you've checked Stop on Error, the rows that were successful up until the time the error occurs is sent to the next step (or the output).  Otherwise, rows successfully processed by Elastic search rows are sent to the next step (or the output).
ID Output Field Indicates the name if the ID field that is in the output.  If this is left blank, the value in the ID Field is used instead.
JSON Input Indicates whether the input is a JSON file.
JSON Field Indicates the JSON node from which processing should begin.

Servers Tab

elasticsearch_bulk_insert_servers.png

Option Description
# Number of the server entry.
Address IP address of the server you want to connect to. 
Port Port number for the server you want to connect to.
Test Connection Verifies that the connection can be made to the servers listed in this tab.

Fields Tab

elasticsearch_bulk_insert_fields.png

Option Description
# Number of the fields entry.
Name Name from the input.
Target Name Output field name.
Get Fields Retrieves the fields from the input.

Settings Tab

elasticsearch_bulk_insert_settings.png

Option Description
# Number of the settings entry.
Setting Name of the batch.
Value Value for the batch.

Reference Information

Elastic, which is the company that makes ElasticSearch, has an API as well as user documentation that can give you more background on the fields in this step.