Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Regex Evaluation

Parent article

The Regex evaluation step matches the strings of an input field against a text pattern you define with a regular expression (regex). This step uses the java.util.regex package. The syntax for creating the regular expressions used by this step is defined in the java.util.regex.Pattern javadoc.

You can use this step to parse a complex string of text and create new fields out of the input field with capture groups (defined by parentheses). For example, if you have an input field containing an author's name in quotes and the number of posts made by them, you can create two new fields in your transformation - one for the name, and one for the number of posts as shown below:

Text to parse:

"Author, Ann" - 53 posts

Regex to create two capture groups:

^"([^"]*)" - (\d*) posts$

The resulting field values are: Ann and 53.

General

Enter the following information in the transformation step field:

  • Step name: Specifies the unique name of the Regex evaluation step on the canvas. You can customize the name or leave it as the default.

Capture Group Fields table

Use the Capture Group Fields table to specify the new fields for the substrings captured by the regular expression from the input string.

ColumnDescription
New fieldName of the new field generated from the regular expression.
TypeType of data.
LengthLength of the field.
PrecisionNumber of floating point digits for number-type fields.
FormatAn optional mask for converting the format of the original field. See Common Formats for information on common valid date and numeric formats you can use in this step.
GroupA grouping can be a "," (10,000.00 for example) or "." (5.000,00 for example)
DecimalThe character used as a decimal point.
CurrencyCurrency symbol ($ or for example)
Null IfTreat this value as null.
DefaultDefault value when the field in the incoming file is not specified (empty).
TrimThe trim method to apply to a string.

Options

This step features several tabs with fields. Each tab is described below.

Settings tab

Settings tab in Regex             evaluation

The Settings tab contains the following options:

OptionDescription
Field to evaluateSpecify the name of the field from the incoming PDI stream to be matched against the regular expression.
Result field nameSpecify the name of the output field. This field is added to the outgoing PDI stream and has a value of Y to indicate the value of the input field matched the regular expression or N to indicate it did not match.
Create fields for capture groupsSelect to create new fields based on capture groups, in the regular expression. When this option is selected, substrings in the captured groups are extracted and stored in new output fields, that you specify in the Capture Group Fields table. Each capture group must have a field defined in the Capture Group Fields table. The order of the fields in the table must be the same as the order of the capturing groups in the regular expression. You can change the data type using the columns in the table.
Replace previous fieldsSelect to replace fields from the incoming PDI stream with fields created for the capture group field names, if the fields have the same name. If this option is clear, new fields are added to the outgoing PDI stream for each capturing group field. This option is available when you select the Create fields for capture groups option.
Regular expressionSpecify your regular expression. Click Test regEx to open the Regular expression evaluation window.
Use variable substitutionSelect to expand variable references to their values before evaluating the regular expression pattern.

Regular expression evaluation window

You can test your regular expression against three different input strings using the following Regular expression evaluation window:

Regular expression evaluation window

If your expression contains a group field, type a string in the Compare section and the option below the string will be split according to your group(s).

The window contains the following options:

FieldDescription
Please enter a new regular expression or modifySpecify your regular expression.
Values to testSpecify the values (Value1, Value2, or Value3) to test your string. The background will turn green if that value is a match against your expression or red if it does not.
Capture from valueDisplays the value of the captured string.
Captured fieldsDisplays the value of the captured groups.

Content tab

Content tab in Regex               Evaluation

The Content tab contains the following options:

OptionDescription
Ignore differences in Unicode encodingsSelect to ignore different Unicode character encodings. This action may improve performance, but your data can only contain US ASCII characters.
Enables case-insensitive matchingSelect to use case-insensitive matching. Only characters in the US-ASCII charset are matched. Unicode-aware case-insensitive matching can be enabled by specifying the 'Unicode-aware case...' flag in conjunction with this flag.

The execution flag is (?i).

Permit whitespace and comments in patternSelect to ignore whitespace and embedded comments starting with # through the end of the line. In this mode, you must use the \s token to match whitespace. If this option is not enabled, whitespace characters appearing in the regular expression are matched as-is.

The execution flag is (?x).

Enable dotall modeSelect to include line terminators with the dot character expression match.

The execution flag is (?s).

Enable multiline modeSelect to match the start of a line '^' or the end of a line '$' of the input sequence. By default, these expressions only match at the beginning and the end of the entire input sequence.

The execution flag is(?m)

Enable Unicode-aware case foldingSelect this option in conjunction with the Enables case-insensitive matching option to perform case-insensitive matching consistent with the Unicode standard.

The execution flag is (?u).

Enables Unix lines modeSelect to only recognize the line terminator in the behavior of '.', '^', and '$'.\

The execution flag is (?d).

Examples

Suppose your input field contains a text value like "Author, Ann" - 53 posts. The following regular expression creates four capturing groups and can be used to parse out the different parts:

^"((["]), (["]))" - (\d+) posts\.$

This expression creates the following four capturing groups, which become output fields:

  • Fullname: ((["]), (["]))
  • Lastname: ([^"]+)
  • Firstname: ([^"]+)
  • Number of posts: (\d+)

In this example, a field definition must be present for each of these capturing groups.

If the number of capture groups in the regular expression does not match the number of fields specified, the step will fail and an error is written to the log. Capturing groups can be nested. In the example above the fields Lastname and Firstname correspond to the capturing groups that are themselves contained inside the Fullname capturing group.

The design-tools/data-integration/samples/transformations directory contains the samples/transformations/Regex Eval - parse NCSA access log records.ktr as another example on how to use this step.