XML Characters Fixing Step

From Okapi Framework
Jump to: navigation, search

Overview

This step replaces characters that are invalid in XML by a marker.

Takes: raw document. Sends: raw document.

You need to specify what is the input encoding of the document. The only case where you do not have to specify the input encoding is for files with a Byte-Order-Mark. The output is done in same encoding as the input document, and Byte-Order-Marks are preserved if they are present.

The step recognizes raw characters as well as numeric character references (NCRs), such as &#000B;, or .

Parameters

Replacement string — Enter the string to use to replace the characters. This string must be a valid Java Formatter string. The parameter passed in always the integer value of the Unicode code-point. For example, to replace the invalid character by its hexadecimal representation placed between braces use: "{%X}".

If you want to remove the invalid characters: leave the replacement string empty.

Here are more examples. Note that some characters like % have special function in this syntax and may need to be escaped.

Invalid Character Replacement String Result
U+000B x%03X x0B
U+000B _#x%X; _#xB;
U+000B ?  ?
U+000B � �
U+000B %%%d%%  %11%

Limitations

None known.