This step allows you to write data to an HBase table according to user-defined column metadata.
When using the HBase Output step with the Adaptive Execution Layer (AEL), the following factors affect performance and results:
- Spark processes null values differently than the Pentaho engine. You will need to adjust your transformation to successfully process null values according to Spark's processing rules.
- Metadata injection is not supported for steps running on AEL.
Enter the following information in the transformation step name field.
- Step Name: Specifies the unique name of the transformation step on the canvas. The Step Name is set to 'HBase Output' by default.
The HBase Output step features two tabs with fields. Each tab is described below.
Configure Connection Tab
This tab contains HBase connection information. You can configure a connection in one of two ways:
- Using the Hadoop cluster properties, or
- By using an hbase-site.xml and (an optional) hbase-default.xml configuration file.
Below the connection details are fields to specify which target HBase table to write to, along with a mapping by which to encode incoming field values.
This tab includes the following fields:
|Step name||The name of this step as it appears in the transformation workspace.|
Click the Hadoop Cluster drop-down menu to select an existing Hadoop cluster configuration.
|URL to hbase-site.xml||Address of the hbase-site.xml file.|
|URL to hbase-default.xml||Address of the hbase-default.xml file.|
|HBase table name||The target HBase table you want to write data into. Click Get table names to populate the drop-down list of possible table names.|
|Mapping name||A mapping to decode and interpret column values. Click Get mappings for the specified table to populate the drop-down list of available mappings.|
|Store mapping info in step meta||Specifies whether to store mapping information in the step's meta data instead of loading it from HBase when it runs.|
|Delete rows by mapping key||Select to instruct HBase to delete rows using the row key on the mapped input field.|
|Disable write to WAL||
Disables writing to the Write Ahead Log (WAL).
The WAL is used as a failsafe to restore the status quo if the server goes down while data is being inserted. Disabling WAL will increase performance.
Not available when Delete rows by mapping is selected.
|Size of write buffer (bytes)||
The size of the write buffer used to transfer data to HBase.
A larger buffer consumes more memory (on both the client and server), but results in fewer remote procedure calls.
If you leave this field empty, the default value (specified in the hbase-default.xml file) is 2MB (2097152 bytes).
Create/Edit Mappings Tab
This tab creates or edits a mapping for a given HBase table. A mapping defines metadata about the values that are stored in the table. Since most information is stored as raw bytes in HBase, mapping allows PDI to decode values and execute meaningful comparisons for column-based result set filtering.
Before a value can be written to HBase, you must define to the step which column family the value belongs to and what its type is. You must also specify type information about the key of the table.
The names of fields entering the step must match the aliases of fields defined in the mapping. All incoming fields must have a matching counterpart in the mapping. There may be fewer incoming fields than defined in the mapping, but if there are more incoming fields, then an error will occur. One of the incoming fields must match the key defined in the mapping.
This tab operates in a similar manner as the HBase Input step, with the exception that the HBase Output step allows the target HBase table to be created if it doesn't already exist. Furthermore, the fields coming into the step to define a mapping.
Select a table to populate the Mapping name drop-down box with the names of any mappings that exist for the table. If there are no mappings defined for the selected table, enter the name of a new mapping.
Enter information about the columns in the HBase table that you want to map. Selecting the name of an existing mapping will load the fields defined in that mapping into the fields area of the display.
Alternatively, you can create a new HBase table and mapping for it simultaneously by configuring the fields of the mapping and entering the name of a table that doesn't exist in the HBase table name drop down box.
This tab includes the following fields:
|HBase table name||Displays a list of table names. Connection information in the previous tab must be valid and complete for this drop-down list to populate. See the Note in Performance Considerations for more options.|
Names of any mappings that exist for the table. This box is empty when there are no mappings defined for the selected table.
You can define multiple mappings on the same HBase table using different subsets of columns.
|#||The order of the mapping operation.|
|Alias||The name you want to assign to the HBase table key. This is required for the table key column, but optional for non-key columns.|
|Key||Indicates whether or not the field is the table's key.|
|Column family||The column family in the HBase source table that the field belongs to. Non-key columns must specify a column family and column name.|
|Column name||The name of the column in the HBase table.|
Data type of the column. When the key value is set to 'Y', the following key column values display in the drop-down list:
Key column types are:
When the key value is set to 'N', the following key column values display in the drop-down list:
Non-key columns types are:
|Indexed values||Enter comma-separated data in this field to define values for string columns.|
|Get incoming fields (button)||Retrieves a field list using the given HBase table and mapping names.|
|Create a tuple template (button)||Select to create a mapping template to write tuples to HBase.|
|Save mapping (button)||Saves the mapping. If there is any missing information in the mapping definition, you will be prompted to correct the mapping definition before the mapping is saved.|
|Delete mapping (button)||Deletes the current named mapping in the current named table from the mapping table. Note that this does not delete the actual HBase table.|
A valid mapping must define meta data for the key of the source HBase table. The key must have an Alias specified because there is no name given to the key of an HBase table. Non-key columns must specify the Column family that they belong to and the Column name. An Alias is optional. If not supplied, then the column name is used. All fields must have type information supplied.
For keys to sort properly in HBase, you must note the distinction between signed and unsigned numbers. Because of the way that HBase stores integer and long data internally, the sign bit must be flipped before storing the signed number so that positive numbers will sort after negative numbers. Unsigned integer and unsigned long data can be stored directly without inverting the sign.
- String columns may optionally have a set of legal values defined for them by entering comma-separated data into the Indexed values column in the fields table.
- Date keys can be stored as either signed or unsigned long data types, with epoch-based timestamps. If you have a date key mapped as a 'String' type, PDI can change the type to 'Date' for manipulation in the transformation. No distinction is made between signed and unsigned numbers for the Date type because HBase only sorts on the key.
- Boolean values may be stored in HBase as 0/1 integer/long or as strings (Y/N, yes/no, true/false, T/F).
- BigNumber may be stored as either a serialized BigDecimal object or in string form (that is, a string that can be parsed by BigDecimal's constructor).
- Serializable is any serialized Java object.
- Binary is a raw array of bytes.
To speed up the creation of a mapping, you can use the incoming fields to the step as the basis for the mapping. Click Get incoming fields to populate the mapping table with information from the fields entering the step. The Alias and Column name of each mapping field will be set to the name of an incoming field. The type information will be filled in automatically, and the Column family will be set to either the name of the first column family defined if the table already exists, or, a default value ("Family1"), which can be altered by the user to define their own families when the target table is created.
The step does not support adding new column families to an existing table.
Important: The names of fields entering the step are expected to match the aliases of fields defined in the mapping. All incoming fields must have a matching counterpart in the mapping. There may be fewer incoming fields than defined in the mapping but if there are more incoming fields then an error will be raised. Furthermore, one of the incoming fields must match the key defined in the mapping.
The HBase Output step's Configure connection tab provides a field for setting the size of the write buffer used to transfer data to HBase. A larger buffer consumes more memory (on both the client and server), but results in fewer remote procedure calls. The default (defined in the hbase-default.xml file) is 2MB. When left blank, the buffer is 2MB, auto flush is enabled, and Put operations are executed immediately. This means that each row will be transmitted to HBase as soon as it arrives at the step. Entering a number (even if it is the same as the default) for the size of the write buffer will disable auto flush and will result in incoming rows only being transferred once the buffer is full.
There is also a checkbox for disabling writing to the Write Ahead Log (WAL). The WAL is used as a lifeline to restore the status quo if the server goes down while data is being inserted. However, the tradeoff for error-recovery is speed.
The Create/edit mappings tab has options for creating new tables. In the HBase table name field, you can suffix the name of the new table with parameters for specifying what kind of compression to use, and whether or not to use Bloom filters to speed up lookups. The options for compression are: NONE, GZ and LZO; the options for Bloom filters are: NONE, ROW, ROWCOL. If nothing is selected (or only the name of the new table is defined), then the default of NONE is used for both compression and Bloom filters. For example, the following string entered in the HBase table name field specifies that a new table called "NewTable" should be created with GZ compression and ROWCOL Bloom filters:
Important: Due to licensing constraints, HBase does not ship with LZO compression libraries; these must be manually installed on each node if you want to use LZO compression.