Encoding Conversion Step

From Okapi Framework
Jump to: navigation, search

Overview

This step converts the character set encoding of a input document, and in some cases, updates its encoding declaration.

Takes: Raw document. Sends: Raw document.

Updating Encoding Declarations

The encoding declarations of XML and HTML documents, given a few conditions.

The following algorithms are run on the first 1024 characters of each document. Note that the routine is not XML or HTML aware and cannot make distinction between normal XML/HTML code and XML/HTML comments.

Detection of XML Documents

  • If an XML encoding declaration is detected:
    • The encoding value is updated.
  • Otherwise, if an XML declaration is detected:
    • An encoding declaration is inserted inside the XML declaration just after the version.
  • Otherwise, if the name of the file ends with a .xml extension:
    • An XML declaration (with an encoding declaration of the output encoding) is added at the top of the document.

Detection of the HTML Documents

  • If an HTML charset declaration is detected:
    • The charset value is updated.
  • Otherwise, if the name of the file ends with an extension that starts with .htm:
    • If a <head> element is found:
      • A charset declaration is added just after.
    • Otherwise, if a <html string is detected:
      • A <head> and a charset declaration are added after the first '>' after the string found.

Note that a document can be both and XML and HTML document and have both types of encoding/charset declarations.

Parameters

Input Tab

Un-escape the following notations

Numeric character references — Set this option to un-escape all types of numeric character references (NCRs) when reading the input document. For example: &#255;, &#xE1; and &#e1; will be un-escaped to 'á'. If this option is not set, any NCR in the input document will remain in the exact same form in the output document. For more on NCR, see http://en.wikipedia.org/wiki/Numeric_character_reference.

Character entity references — Set this option to un-escape the standard HTML character entity references (CERs) when reading the input document. For example: &aacute; will be un-escaped to 'á'. If this option is not set, any CER in the input document will remain in the exact same form in the output document. For more on CER, see http://en.wikipedia.org/wiki/Character_entity_reference.

Note that the character entity references &amp;, &lt;, &gt;, &apos; and &quot; are not un-escaped as they may need to be preserved regardless of the encoding in HTML and XML document.

Java-style escape notation — Set this option to un-escape the Java-style escaped characters when reading the input document. For example: \u00e1 and \u00E1 will be un-escaped to 'á'. If this option is not set all Java-style escaped characters in the input document will remain in the exact same form in the output document.

Output Tab

What characters should be escaped

Only the characters un-supported by the output encoding — Select this option to escape only characters not supported by the output encoding.

All extended characters — Select this option to escape all extended characters.

Escape notation to use

Uppercase hexadecimal numeric character reference — Select this option to use an uppercase hexadecimal NCR when escaping a character. For example 'á' will become &#xE1;.

Lowercase hexadecimal numeric character reference — Select this option to use an lowercase hexadecimal NCR when escaping a character. For example 'á' will become &#xe1;.

Decimal numeric character reference — Select this option to use a decimal NCR when escaping a character. For example 'á' will become &#255;.

Character entity references — Select this option to use a character entity reference when escaping a character. For example 'á' will become &#aacute;. If there is no corresponding entity defined for the character to escape, the uppercase hexadecimal NCR form is used instead.

Uppercase Java-style notation — Select this option to use an uppercase Java-style notation when escaping a character. For example 'á' will become \u00E1.

Lowercase Java-style notation — Select this option to use a lowercase Java-style notation when escaping a character. For example 'á' will become \u00e1.

User-defined notation — Select this option to use a customized format when escaping a character. The user-defined format must be a Java format expression that take an integer as the value to display. For example the expression [[%d]] for 'á' will give [[255]], and the expression \'%x for 日本語 will give \'65e5\'672c\'8a9e. The value display in the Unicode code point of the character to escape.

Use the bytes values — Set this option when you want the values applied to the user-defined expression to be the byte value of the output encoding rather than the Unicode code points. For example, the expression \'%x for 日本語 will give \'93\'fa\'96\'7b\'8c\'ea if this option is set and the output encoding is Shift-JIS.

Miscellaneous

Use Byte-Order-Mark for UTF-8 output — Set this option to add a Byte-Order-Mark (BOM) for UTF-8 output. For more information on the BOM see http://www.unicode.org/faq/utf_bom.html.

List characters not supported by the output encoding — Set this option to list in the Log all characters not supported by the output encoding.