Whitespace Correction Step

From Okapi Framework
Jump to: navigation, search

Overview

This step is intended to simplify the addition or removal of inter-segment whitespace when translating to or from Chinese or Japanese scripts that do not typically use it. The step will perform two separate tasks, depending on the source and target-locales:

  • When translating from a space-delimited language to a non-space-delimited language, whitespace following segment-ending punctuation will be removed.
  • When translating from a non-space-delimited language to a space-delimited language, whitespace will be added following segment-ending punctuation.

This step will perform no action when translating from one space-delimited language to another space-delimited language (for example, from English to French), or when translating between Chinese and Japanese.

Takes: Filter events. Sends: Filter events.

Parameters

The step can be configured to apply its space adjustment to each the following classes of punctuation:

  • Full Stop - Converts Ideographic Full Stop (U+3002) and Full-width Full Stop (U+FF0E) to/from a period.
  • Comma - Converts Ideographic Comma (U+3001) and Full-width Comma (U+FF0C) to/from a comma.
  • Exclamation Point - Converts Full-width Exclamation Mark (U+FF01) to/from an exclamation point.
  • Question Mark - Converts Full-width Question Mark (U+FF1F) to/from a question mark.

Limitations

This process is not foolproof, as it relies on the assumption that each source segment contains a single sentence, and has also been translated to a single sentence in the target language.