Full-Width Conversion Step

From Okapi Framework
Jump to: navigation, search

Overview

This step converts characters in text units from or to full-width form.

Takes: Filter events. Sends: Filter events.

For historical reasons, some Asian character sets have two display forms for some characters: half-width and full-width. This step allows you to convert from one form to the other. The modification is done in the text of the text units for the specified target locale. If there is no text for the specified target, the source text is copied to the target and processed.

Parameters

Convert full width characters to half-width or ASCII equivalents — Select this option to convert all full-width character to their half-width or ASCII equivalent. For example, the character 'Q' (U+FF31) is converted to 'Q' (U+0051) and the character 'サ' (U+30B5) is converted to 'サ' (U+FF7B).

Additional non-Full-width characters can also be converted:

Include Squared Latin Abbreviations of the CJK Compatibility block — Set this option to also convert the Squared Latin Abbreviations of the CJK Compatibility block into sequences of non-CJK characters. For example '㏀' (U+33C0) to "kΩ" (U+006B, U+03A9).

Include special characters of the Letter-Like Symbols block — Set this option to also convert several characters of the Letter-Like Symbols block to character sequences. The conversions are shown in the following table:

Letter-Like Symbol Character sequence
U+2100 a/c
U+2101 a/s
U+2105 c/o
U+2103 °C
U+2109 °F
U+2116 No
U+212A K
U+212B Å

Include Japanese Katakana and associated punctuation — Set this option to convert Japanese Katakana and associated punctuation (。、「」, etc.) into their half-width forms. This is a separate option (and off by default) in order to facilitate normalizing modern Japanese text: Japanese text may contain full-width alphanumeric characters that should be normalized to half-width, while Katakana should remain full-width. (Available in 0.29-SNAPSHOT and later)

Convert half-width and ASCII characters to full width equivalents — Select this option to convert all half-width and ASCII characters to their full-width equivalent. For example, the character 'Q' (U+0051) is converted to 'Q' (U+FF31) and the character 'サ' (U+FF7B) is converted to 'サ' (U+30B5).

Convert only the ASCII characters — Set this option to convert only the ASCII characters to full-width. When this option is set only ASCII characters are affected, half-width chracaters are left half-width.

Convert only Japanese Katakana and associated punctuation — Set this option to convert Japanese Katakana and associated punctuation (。、「」, etc.) into their full-width forms. This is a separate option in order to facilitate normalizing modern Japanese text: Japanese text may contain half-width Katakana that should be converted to full-width, while alphanumeric characters should remain half-width. (Available in 0.29-SNAPSHOT and later)

Normalize output — Apply Unicode NFC normalization to the output text (if any conversions are made). Converting half-width forms to full-width can result in decomposed forms, for instance プ (U+FF8C U+FF9F) → プ (U+30D5 U+309A). Normalization ensures that the standard representation is used: プ (U+30D7). (Available in 0.29-SNAPSHOT and later)

Limitations

None known.