XML Stream Filter
Overview
The XML Stream Filter is an Okapi component that implements the IFilter interface for XML documents. It uses a stream parser, which allows to process much larger documents than a filter based on a DOM-based parser like the XML Filter. If you need to use ITS, use the XML Filter.
Processing Details
Input Encoding
The filter decides which encoding to use for the input document using the following logic:
- If the document has an encoding declaration it is used.
- Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).
Output Encoding
If the output encoding is UTF-8:
- If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
- If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.
If the original document had an XML encoding declaration it is updated, if it did not, one is automatically added.
Line-Breaks
The type of line-breaks of the output is the same as the one of the original input.
Quote Mode
Escaping of quote and apostrophe (single quote) characters can be changed by adding these lines to the config file:
quoteModeDefined: true quoteMode: 3
Current quote modes:
- Do not escape single or double quotes: UNESCAPED = 0
- Escape single and double quotes to a named entity: ALL = 1
- Escape double quotes to a named entity, and single quotes to a numeric entity: NUMERIC_SINGLE_QUOTES = 2
- Escape double quotes only: DOUBLE_QUOTES_ONLY = 3
Parameters
This filter uses the same type of YAML-based configuration as the HTML Filter. An editor for creating and modifying its configuration is under construction. All parameters described in the HTML Filter's documentation are also available in the XML Stream Filter, with the exception of escapeCharacters
.
Additional Filtering of CDATA Content
Some XML contains additional content within CDATA sections. The default behavior of the XML Stream Filter is to expose CDATA content directly for translation. However, in many cases the CDATA content is in another format (often HTML) that requires additional markup processing prior to translation. The global_cdata_subfilter
parameter specifies an additional filter that will be applied to all CDATA content. The value of this option should be the name of another filter configuration. For example, to process CDATA content as HTML, use the option:
global_cdata_subfilter: okf_html
Additional Filtering of PCDATA Content
Some XML contains additional content in other format directly as PCDATA -- within its element content. Most frequently, this is HTML content that has been escaped an additional time. An example of this type of content might look like:
<test> <p>This is embedded HTML content within a p tag.</p> </test>
This content must be first extracted as XML content and then processed with an additional filter before being exposed for translation. The global_pcdata_subfilter
specifies an additional filter that will be applied to such content. The value of this option should be the name of another filter configuration. For example, to process PCDATA content as HTML, use the option:
global_pcdata_subfilter: okf_html
Note: The global_pcdata_subfilter
option doesn't pass all PCDATA to the subfilter. Only content that has been matched as part of a TEXTUNIT
tag rule will be passed to the subfilter. Content that is matched with an INCLUDE
rule will not be passed to the subfilter. So to process the example above, you would need to make sure that the test
tag was matching a TEXTUNIT
rule:
global_pcdata_subfilter: okf_html elements: test: ruleTypes: [TEXTUNIT]
See the HTML Filter documentation for more information about tag rules.
Limitations
- There is no transparent support for namespace prefixes: You have to declare the element names with their prefixed in the configuration.
- The filter is not case-sensitive (e.g. the elements
<elem>
and<Elem>
are seen as identical, which is not the case according the XML specification.