Skip to main content
Pentaho Documentation

Cassandra Input

The Cassandra Input step reads in data from a column family (table) of an Apache Cassandra database using CQL (Cassandra Query Language).

Options

Cassandra Input Properties Dialog Box.png

The following options are available for the Cassandra Input transformation step:

Option Description

Step name

Specifies the unique name of the Cassandra Input step on the canvas. You can customize the name or leave it as the default.

Cassandra host

Specifies the host name for the connection to the Cassandra server.

Cassandra port

Specifies the port number for the connection to the Cassandra server.

Socket timeout

Sets an optional connection timeout period, specified in milliseconds.

Transport max length

Sets an optional maximum object size that can be sent.

Username

Specifies the username of the target keyspace and/or family (table) authentication details.

Password

Specifies the password of the target keyspace and/or family (table) authentication details.

Keyspace

Specifies the keyspace (database) name.

Output <key, column, timestamp> tuples

Includes the tuple outputs for key, column and time stamp the data retrieved from the column family.

Use CQL version 3

Queries with CQL version 3.

Use Thrift I/O

Uses Thrift I/O.

Use query compression

Compresses the text of the CQL query before sending it to the server.

Execute query for each incoming row

Executes a query for each input row.

Show schema

Opens a dialog box that shows metadata for the column family named in the CQL SELECT query.

Preview

Displays the rows generated by this step.

CQL SELECT Query

Because Cassandra is a sparse column oriented database similar to HBase, it is possible for rows to contain varying numbers of columns which might or might not be defined in the metadata for the column family. The Cassandra Input step can emit columns that are not defined in the metadata for the column family in question if they are explicitly named in the SELECT clause. Cassandra Input uses type information present in the metadata for a column family. This, at a minimum, includes a default type (column validator) for the column family. If there is explicit metadata for individual columns available, then this is used for type information, otherwise the default validator is used.

Important: Cassandra Input does not support the CQL range notation, for instance name1..nameN, for specifying columns in a SELECT query.

You can enter your CQL SELECT statement for querying the column family in the large text box at the bottom of the dialog box. Only a single SELECT query is accepted by the step. The following example query shows the possible format of the statement:

SELECT [FIRST N] [REVERSED] <SELECT EXPR> FROM <COLUMN FAMILY> [USING <CONSISTENCY>] [WHERE <CLAUSE>] [LIMIT N];

SELECT queries may name columns explicitly (in a comma separated list) or use the * wildcard. If you use the * wildcard, then only those columns defined in the metadata for the column family in question are returned. If columns are selected explicitly, then the name of each column must be enclosed in single quotation marks.

The following table describes the elements of the CQL SELECT statement:

Element Description
FIRST N Returns the first N (where N is determined by the column sorting strategy used for the column family in question) column values from each row, if the column family in question is sparse then it may result in a different N (or less) column values appearing from one row to the next. Because PDI deals with a constant number of fields between steps in a transformation, Cassandra rows that do not contain particular columns are output as rows with null field values for non-existent columns. Cassandra's default for FIRST (if omitted from the query) is 10,000 columns. If a query is expected to return more than 10,000 columns, then an explicit FIRST must be added to the query.
REVERSED Reverses the sort order of the columns returned by Cassandra for each row. It may affect which values result from a FIRST N option, but does not affect the order of the columns output by Cassandra Input.
WHERE clause

Filters the rows that appear in results. The clause can filter on any of the following factors:

  • A key name
  • Range of keys
  • Column values in the case of indexed columns
Key filters are specified using the KEY keyword, a relational operator (one of =, >, >=, <, and <=), and a term value.
LIMIT Limits the number of rows returned. If the query is expected to return more than 10,000 rows, an explicit LIMIT clause must be added to the query. If omitted, Cassandra assumes a default limit of 10,000 rows to be returned by the query.

WHERE Clause

When terms appear on both sides of a relational operator, the filter is applied to an indexed column. With column index filters, the term on the left of the operator is the name, the term on the right is the value to filter on. When filtering on indexed columns, at least one equality operator must be present. Using inequality operators result in ranges that are inclusive of the terms (i.e. > is the same as >=, and < is the same as <=).

A WHERE clause may be used to filter rows that appear in the results.

SELECT ... WHERE KEY = keyname AND name1 = value1
SELECT ... WHERE KEY >= startkey and KEY =< endkey AND name1 = value1
SELECT ... WHERE KEY IN ('<key>', '<key>', '<key>', ...)

Metadata Injection Support

All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.