Cleanup Step

From Okapi Framework
This step cleans strings by normalizing quotes, punctuation, etc. ready for further processing.

Takes: Filter events. Sends: Filter events.

By default, all whitespace is normalized before any further processing is performed; all multiple space, tab, etc. characters are replaced with a single instance.


Normalize quotation marks — Set this option to replace all quotation marks with straight double quotes (") and all apostrophes with single straight quotes (').

Mark segments matching default regular expressions for removal — This option is not currently used.

Mark segments matching user defined regular expressions for removal — Set this option to remove text units that contain text that matches the user defined regular expression.

Check for corrupt or unexpected characters — Set this option to detect and remove text units that contain common corrupt character strings.

Remove unnecessary segments from text unit — Set this option to remove text units that have been marked for removal or have no target text.


Does not work with Asian or bi-directional languages.