Search and Replace Step

From Okapi Framework
Jump to: navigation, search

Overview

This step performs search and replace actions on either the text units or the full content of input documents.

Takes: Raw document or Filter events. Sends: same as the input.

The step can take as input either a raw document or filter events.

  • If the step receives filter events, the search and replace is done on the content of the text units, and the step sends updated filter events to the next step.
  • If the step receives a raw document, the search and replace is done on the whole file, and the step sends an updated raw document to the next step. Note that in this case, the raw document must be in some text-based file format for the search and replace to work: The document is seen exactly like it would be in a text editor (no conversion of escaped characters is done for example).

Patterns are processed in order they are declared in the list.

Parameters

The items list contains three columns:

  1. The Use column where a check box indicates if the given pattern should be used (checked) or not (un-checked).
  2. The Search for column where the text or regular expression to search for is displayed.
  3. The Replace by column where the replacement text or regular expression is displayed.

To edit an item: Double-click the item, or click the Edit button. To check and un-check an item: Click the checkbox, or press Space when the item is selected.

Items are processed in order they appear in the list. Make sure that longer patterns are placed first in the list if there is a chance of overlapping matches.

The patterns use the following special expressions:

Regular Expressions OFF Regular Expressions ON
Search \n = line-feed

\r = carriage-return
\t = tab
\\ = backslash
\uHHHH = Unicode character HHHH (in hexadecimal)

\n = line-feed

\r = carriage-return
\t = tab
\\ = backslash
\uHHHH = Unicode character HHHH (in hexadecimal)
Any Java regular expression pattern (see Java Class Patterns)

Replace \n = line-feed

\r = carriage-return
\t = tab
\\ = backslash
\uHHHH = Unicode character HHHH (in hexadecimal)

\n = line-feed

\r = carriage-return
\t = tab
\\ = backslash
\uHHHH = Unicode character HHHH (in hexadecimal)
\$ = dollar sign
$N = match of search group N

Note that in all cases: You must use \\ to represent a literal '\'. For any sequence \C where C is not a special case, the result is the literal character itself. For example \* is a '*'.

Add — Click this button to add a new item in the list.

Edit — Click this button to edit the item currently selected.

Remove — Click this button to remove the item currently selected from the list.

Move Up — Click this button to move the item currently selected upward.

Move Down — Click this button to move the item currently selected downward.

Import — Click this button to import an existing file that contains saved options for this step.

Export — Click this button to export to a file the current options for this step.

Use regular expressions — Set this option to enable the Regular Expressions mode. Note that the placeholders to use in the replacement string for the search group are "$1", "$2", etc. ("$1" corresponds to the part matching the first group of the search expression, etc.) See the documentation on Matcher.replaceAll() for more details.

Path of file with replacements — Provide a path to a tab delimited, UTF-8 encoded, "2 column" text file. The first column contains the search strings and the second contains the replacement strings. This file is not loaded into the table below but processed separately AFTER the table expressions have been processed. The replacements in the tab delimited file is currently limited to literal searches (non-regex and true with sub-strings). You can use the ${rootDir} and ${inputRootDir} variables in the specified path.

Save in the following file a log of the replacements performed — Set this option and enter a path to save a log of which replacements have been made. You can use the ${rootDir} and ${inputRootDir} variables in the path of the log file.

Regular expression options

When this option is checked several regular expression options are accessible. They apply to all expressions, not just the selected one:

Dot also matches line-feed — Set this option to changes the meaning of the period character "." so that it matches every character instead of every character except \n.

Multiline — Set this option to changes the meaning of "^" and "$" so that they match at the beginning and end of any line, not just the beginning and end of the whole string.

Ignore case differences — Set this option to ignore the cases in the matches. for example "bear" matches "Bearcat", "BEARCAT", and "bearcat".

Replace all instances of the pattern — Set this option to replace all matches found. If this option is not set only the first match is replaced. When processing text units, this option replaces all or only one occurrence for each text unit. When processing the whole file, this option replaces all or only one occurrence for the whole file.

When processing text units (i.e. using a filter)

Search and replace the source content — Set this option to perform the search and replace in the source content. This is not enabled by default.

Search and replace the target content — Set this option to perform the search and replace in the target content. This is enabled by default.

Note: When working on the target content, you may have to use the Create Target Step before this step to make sure there is a content in the target.

Both of these options are ignored if the step is processing a raw document.

Limitations

  • When working with a raw document as input, the whole content of the document is loaded in memory. This may result in problems with very large documents.