XML Stream Filter
The XML Stream Filter is an Okapi component that implements the IFilter interface for XML documents. It uses a stream parser, which allows to process much larger documents than a filter based on a DOM-based parser like the XML Filter. If you need to use ITS, use the XML Filter.
The filter decides which encoding to use for the input document using the following logic:
- If the document has an encoding declaration it is used.
- Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).
If the output encoding is UTF-8:
- If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
- If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.
If the original document had an XML encoding declaration it is updated, if it did not, one is automatically added.
The type of line-breaks of the output is the same as the one of the original input.
Escaping of quote and apostrophe (single quote) characters can be changed by adding these lines to the config file:
quoteModeDefined: true quoteMode: 3
Current quote modes:
- Do not escape single or double quotes: UNESCAPED = 0
- Escape single and double quotes to a named entity: ALL = 1
- Escape double quotes to a named entity, and single quotes to a numeric entity: NUMERIC_SINGLE_QUOTES = 2
- Escape double quotes only: DOUBLE_QUOTES_ONLY = 3
This filter uses the same type of YAML-based configuration as the HTML Filter. An editor for creating and modifying its configuration is under construction. All parameters described in the HTML Filter's documentation are also available in the XML Stream Filter, with the exception of
Additional Filtering of CDATA Content
Some XML contains additional content within CDATA sections. The default behavior of the XML Stream Filter is to expose CDATA content directly for translation. However, in many cases the CDATA content is in another format (often HTML) that requires additional markup processing prior to translation. The
global_cdata_subfilter parameter specifies an additional filter that will be applied to all CDATA content. The value of this option should be the name of another filter configuration. For example, to process CDATA content as HTML, use the option:
Additional Filtering of PCDATA Content
Some XML contains additional content in other format directly as PCDATA -- within its element content. Most frequently, this is HTML content that has been escaped an additional time. An example of this type of content might look like:
<test> <p>This is embedded HTML content within a p tag.</p> </test>
This content must be first extracted as XML content and then processed with an additional filter before being exposed for translation. The
global_pcdata_subfilter specifies an additional filter that will be applied to such content. The value of this option should be the name of another filter configuration. For example, to process PCDATA content as HTML, use the option:
global_pcdata_subfilter option doesn't pass all PCDATA to the subfilter. Only content that has been matched as part of a
TEXTUNIT tag rule will be passed to the subfilter. Content that is matched with an
INCLUDE rule will not be passed to the subfilter. So to process the example above, you would need to make sure that the
test tag was matching a
global_pcdata_subfilter: okf_html elements: test: ruleTypes: [TEXTUNIT]
See the HTML Filter documentation for more information about tag rules.
- There is no transparent support for namespace prefixes: You have to declare the element names with their prefixed in the configuration.
- The filter is not case-sensitive (e.g. the elements
<Elem>are seen as identical, which is not the case according the XML specification.