XML Stream Filter

From Okapi Framework
Jump to: navigation, search

Overview

The XML Stream Filter is an Okapi component that implements the IFilter interface for XML documents. It uses a stream parser, which allows to process much larger documents than a filter based on a DOM-based parser like the XML Filter. If you need to use ITS, use the XML Filter.

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the following logic:

  • If the document has an encoding declaration it is used.
  • Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).

Output Encoding

If the output encoding is UTF-8:

  • If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
  • If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

If the original document had an XML encoding declaration it is updated, if it did not, one is automatically added.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Quote Mode

Escaping of quote and apostrophe (single quote) characters can be changed by adding these lines to the config file:

quoteModeDefined: true
quoteMode: 3

Current quote modes:

  • Do not escape single or double quotes: UNESCAPED = 0
  • Escape single and double quotes to a named entity: ALL = 1
  • Escape double quotes to a named entity, and single quotes to a numeric entity: NUMERIC_SINGLE_QUOTES = 2
  • Escape double quotes only: DOUBLE_QUOTES_ONLY = 3

Parameters

This filter uses the same type of YAML-based configuration as the HTML Filter. An editor for creating and modifying its configuration is under construction. All parameters described in the HTML Filter's documentation are also available in the XML Stream Filter, with the exception of escapeCharacters.

Additional Filtering of CDATA Content

Some XML contains additional content within CDATA sections. The default behavior of the XML Stream Filter is to expose CDATA content directly for translation. However, in many cases the CDATA content is in another format (often HTML) that requires additional markup processing prior to translation. The global_cdata_subfilter parameter specifies an additional filter that will be applied to all CDATA content. The value of this option should be the name of another filter configuration. For example, to process CDATA content as HTML, use the option:

 global_cdata_subfilter: okf_html

Additional Filtering of PCDATA Content

Some XML contains additional content in other format directly as PCDATA -- within its element content. Most frequently, this is HTML content that has been escaped an additional time. An example of this type of content might look like:

<test>
 &lt;p&gt;This is embedded HTML content within a p tag.&lt;/p&gt;
</test>

This content must be first extracted as XML content and then processed with an additional filter before being exposed for translation. The global_pcdata_subfilter specifies an additional filter that will be applied to such content. The value of this option should be the name of another filter configuration. For example, to process PCDATA content as HTML, use the option:

global_pcdata_subfilter: okf_html

Note: The global_pcdata_subfilter option doesn't pass all PCDATA to the subfilter. Only content that has been matched as part of a TEXTUNIT tag rule will be passed to the subfilter. Content that is matched with an INCLUDE rule will not be passed to the subfilter. So to process the example above, you would need to make sure that the test tag was matching a TEXTUNIT rule:

global_pcdata_subfilter: okf_html
elements:
 test:
   ruleTypes: [TEXTUNIT]

See the HTML Filter documentation for more information about tag rules.

Limitations

  • There is no transparent support for namespace prefixes: You have to declare the element names with their prefixed in the configuration.
  • The filter is not case-sensitive (e.g. the elements <elem> and <Elem> are seen as identical, which is not the case according the XML specification.