Whitespace Correction Step
Overview
This step is intended to simplify the addition or removal of inter-segment whitespace when translating to or from Chinese or Japanese scripts that do not typically use it. The step will perform two separate tasks, depending on the source and target-locales:
- When translating from a space-delimited language to a non-space-delimited language, whitespace following segment-ending punctuation will be removed.
- When translating from a non-space-delimited language to a space-delimited language, whitespace will be added following segment-ending punctuation.
This step will perform no action when translating from one space-delimited language to another space-delimited language (for example, from English to French), or when translating between Chinese and Japanese.
Takes: Filter events. Sends: Filter events.
Parameters
The step can be configured to apply its space adjustment to each the following classes of punctuation:
- Full Stop - Converts Ideographic Full Stop (U+3002) and Full-width Full Stop (U+FF0E) to/from a period.
- Comma - Converts Ideographic Comma (U+3001) and Full-width Comma (U+FF0C) to/from a comma.
- Exclamation Point - Converts Full-width Exclamation Mark (U+FF01) to/from an exclamation point.
- Question Mark - Converts Full-width Question Mark (U+FF1F) to/from a question mark.
Limitations
This process is not foolproof, as it relies on the assumption that each source segment contains a single sentence, and has also been translated to a single sentence in the target language.