XML Filter
Overview
This filter allows you to process XML documents. It uses a DOM-based parser, which allows it to implement ITS. If you need to process very large XML documents and have no need for ITS, you may want to look at using the XML Stream Filter.
The following is an example of a simple XML document. The translatable text is highlighted. Because each format based on XML is different, you need information on what are the translatable parts, what are the inline elements, etc. The XML Filter implements the ITS W3C Recommendation to address this issue.
<?xml version="1.0" encoding="utf-8"?> <myDoc> <prolog> <author>Zebulon Fairfield</author> <version>version 12, revision 2 - 2006-08-14</version> <keywords><kw>horse</kw><kw>appaloosa</kw></keywords> <storageKey>articles-6D272BA9-3B89CAD8</storageKey> </prolog> <body> <title>Appaloosa</title> <p>The Appaloosas are rugged horses originally breed by the <kw>Nez-Perce</kw> tribe in the US Northwest.</p> <p>They are often characterized by their spotted coats.</p> </body> </myDoc>
This filter is implemented in the class net.sf.okapi.filters.xml.XMLFilter
of the library.
Processing Details
Input Encoding
The filter decides which encoding to use for the input document using the following logic:
- If the document has an encoding declaration it is used.
- Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).
Output Encoding
If the output encoding is UTF-8:
- If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
- If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.
If the original document had an XML encoding declaration it is updated, if it did not, one is automatically added.
Line-Breaks
The type of line-breaks of the output is the same as the one of the original input.
Parameters
This filter stores its parameters in an XML file and does not provide an editor to modify it. You can edit the file in a simple text editor, or with an XML editor. For an example, see the article "How to Create a Custom Configuration for the XML Filter".
ITS Support
By default the filter process the XML documents based on the ITS defaults. That is:
- the content of all elements is translatable,
- and none of the values of the attribute translatable.
Different behavior can occur if the input document contains ITS markup, or if a filter parameters file is specified. The parameters file used by the the XML Filter is an ITS document.
The Internationalization Tag set (ITS) is a W3C recommendation that defines a set of elements and attributes you can use to specify different internationalization- and localization-related aspects of your XML document, for instance: ITS defines what attribute values are translatable, what element content should be protected, what element should be treated as a nested sub-flow of text, and much more.
The filter supports ITS 1.0 and ITS 2.0 (2.0 is backward compatible with 1.0)
- The ITS 1.0 specification is available at http://www.w3.org/TR/its/.
- The ITS 2.0 specification is available at http://www.w3.org/TR/its20/.
See the "ITS" page for more details on the format.
The filter supports global and local rules and most data categories. See the ITS Components page for a detailed list of how the data categories are supported and other information on the implementation.
ITS Extensions
The filter supports extensions to the ITS specification. These extension use the namespace URI http://www.w3.org/2008/12/its-extensions.
idValue and xml:id
When the attribute xml:id
is found on a translatable element, it is used as the name of the text unit generated for that element.
For example, in the example below, the resource name associated with the text unit for the <p>
element is "id1
".
<p xml:id="id1">Text</p>
The attribute idValue
used in the ITS translateRule
element allows you to define an XPath expression that correspeonds to the identifier value for the given selection. The value of idValue
must be an expression that can return a string. A node location is a valid expression: it will return the value of the first node at the given location.
For example, in the example below, the resource name associated with the text unit for the <p>
element is "id1
":
<doc> <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/2008/12/its-extensions"> <its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/> </its:rules> <p name="id1">text 1</p> </doc>
Note that xml:id
has precedence over idValue
declaration. For example, in the example below, the resource name associated with the text unit for the <p>
element is "xid1
", not "id1
".
<doc> <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/2008/12/its-extensions"> <its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/> </its:rules> <p xml:id="xid1" name="id1">text 1</p> </doc>
You can build complex ID based on different attributes, element or event hard-coded text. Any of the String functions offered by XPath can be used.
For example, in the file below, the two elements &tl;text>
and <desc>
are translatable, but they have only one corresponding ID, the name
attribute in their parent element. To make sure you have a unique identifier for both the content of <text>
and the content of <desc>
, you can use the rules set in the example. The XPath expression "concat(../@name, '_t')
" will give the ID "id1_t
" and the expression "concat(../@name, '_d')
" will give the ID "id1_d
".
<doc> <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/2008/12/its-extensions"> <its:translateRule selector="//text" translate="yes" itsx:idValue="concat(../@name, '_t')"/> <its:translateRule selector="//desc" translate="yes" itsx:idValue="concat(../@name, '_d')"/> </its:rules> <msg name="id1"> <text>Value of text</text> <desc>Value of desc</desc> </msg> </doc>
whiteSpaces
The extension attribute whiteSpaces allows you to apply globally the equivalent of a local xml:space
attribute.
For example, if you have a format where all element <pre>
must have their spaces, tabs and line breaks preserved, you can specify the attribute whiteSpaces="preserve"
in a <its:translateRule>
element for the <pre>
elements. In the example below, the spaces in the <pre>
element will be preserved on extraction.
<doc> <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/2008/12/its-extensions"> <its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/> </its:rules> <pre>Some txt with many spaces. </pre> </doc>
Note that the xml:space
attribute has precedence over whiteSpaces
. For example, in the following example, the white spaces in the content of <pre>
may not be preserved because the attribute xml:space
has the value default
:
<doc> <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/2008/12/its-extensions"> <its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/> </its:rules> &<pre xml:space="default">Some txt with many spaces. </pre> </doc>
Filter Options
The filter supports also options in addition to ITS and ITS extension. These options use the namespace URI okapi-framework:xmlfilter-options
.
When you use several options, they must be set in a single <okp:options>
element, as shown below:
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options lineBreakAsCode="yes" escapeQuotes="no" escapeGT="yes" /> </its:rules>
The following options are available:
- lineBreakAsCode
- codeFinder
- omitXMLDeclaration
- escapeQuotes
- escapeGT
- escapeNbsp
- extractIfOnlyCodes
- inlineCdata
- extractUntranslatable
lineBreakAsCode
In some cases the content of element includes line-breaks that need to be included as part of the content but without using an actual line-break in the extracted text. For example in some XML documents generated by Excel, the formatting of the cells is marked up with
entity references. They need to be passed as inline codes.
By default this option is set to false.
To specify this the filter use the extension lineBreakAsCode
extension attribute. This affect all the extracted content.
For example: The following code is an ITS document with the option to treat line-breaks as code. It can be used along with the example of XML document listed below.
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options lineBreakAsCode="yes"/> </its:rules>
<doc> line 1 line 2. </doc>
codeFinder
You can define a set of regular expressions to capture span of extracted text that should be treated as inline codes. For example, some element content may have variables, or HTML tags that need to be protected from modification and treated as codes. Use the codeFinder element for this.
In the following parameters file, the codeFinder
element defines two rules:
- The first one (rule0) is "
<(/?)\w[^>]*?>
" and matches any XML-type tags (e.g. "<b>
", "</b>
", "<br/>
") - The second one (rule1) is "
(#\w+?\#)|(%\d+?%)
" and matches any word enclosed in#
(e.g. "#VAR#
") or number enclosed in%
(e.g. "%1%
").
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:codeFinder useCodeFinder="yes">#v1 count.i=2 rule0=<(/?)\w+[^>]*?> rule1=(#\w+?\#)|(%\d+?%) </okp:codeFinder> </its:rules>
Some important details:
- Set
useCodeFinder
to "yes" to have the rules used, if the attribute is missing its value is assumed to be "no". - Make sure the first line of the
<codeFinder>
element content is#v1
. - Each entry in the content must be on a separate line.
count.i=N
must be before any rules andN
must be the number of rules.ruleN
must be incremented starting at 0.- The pattern for a rule must be escaped for XML, for example: "
<(/?)\w[^>]*?>
" must be entered "<(/?)\w[^<]*?>
" in the parameters file. - Do not put spaces before
count.i
orruleN
, and not after your expressions.
To facilitate the creation of code finder rules Rainbow provides the Code Finder Editor.
omitXMLDeclaration
By default an XML declaration is always set at the top of the output document (regardless wether the original document has one or not). It is an important part of the XML document and it is especially needed when the encoding of the output document is not UTF-8, UTF-16 or UTF-32, as its name must be specified in the XML declaration. However, there are a few special cases when the declaration is better left off. To handle those rare cases, you can use omitXMLDeclation
to indicate the filter to not output the XML declaration.
For example:
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options omitXMLDeclaration="yes"/> </its:rules>
Remember that XML documents without an XML declaration may be read incorrectly if the encoding of the document is not UTF-8, UTF-16 or UTF-32.
escapeQuotes
By default, when processing the document, the filter uses double-quotes to enclose all attributes (translatable or not) and use the following rules for escaping/not-escaping the literal quotes:
- Inside the attribute values:
- Single-quotes (=apostrophes) are never escaped
- Double-quotes are always escaped
- In element content:
- Single-quotes (=apostrophes) are not escaped
- Double-quotes are escaped by default
You cannot change the escaping rules for attributes.
For element content: If the document is processed without triggering any rule that allow the translation of an attribute, then (and only then) the filter takes into account the escapeQuotes
option to escape or not double-quotes in the translatable content.
For example, the following parameters file allows to not escape double-quotes in element content (for the documents where there is no rule for translatable attributes are triggered):
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options escapeQuotes="no"/> </its:rules>
escapeGT
By default the character '>
' is escaped. You can indicate to the filter to not escape it using the escapeGT
option.
For example, the following parameters file indicates to not escape greater-than characters:
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options escapeGT="no"/> </its:rules>
escapeNbsp
By default the non-breaking space character is escaped (in the form  
). You can indicate to the filter to not escape it using the escapeNbsp
option.
For example, the following parameters file indicates to not escape the non-breaking space characters:
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options escapeNbsp="no"/> </its:rules>
extractIfOnlyCodes
By default all extractable entries are extracted even when they contain only white-spaces and/or inline codes. You can indicate to the filter to not extract such entries using the extractIfOnlyCodes
option.
For example, the following parameters file indicates to not extract entries with only whte-spaces and/or inline codes:
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options extractIfOnlyCodes="no"/> </its:rules>
inlineCdata
By default, CDATA sections will be exposed as regular content, and the CDATA markers themselves will be discarded. When the inlineCdata
option is set,
the CDATA markers will be exposed as inline codes.
For example, the following parameters file will expose CDATA markers as inline codes:
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options inlineCdata="yes"/> </its:rules>
extractUntranslatable
All untranslatable entries (its:translate="no"
) are not extracted by default. And in order to allow the extraction of such entries for context reasons, the following option has to be used: extractUntranslatable
.
Below is an example of this option declaration:
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options extractUntranslatable="yes"/> </its:rules>
With this option contents that are untranslatable will be extracted, but marked as translate="no" in xliff.
Hint: If you want to extract certain untranslatable contents and others not: By default all untranslatable contents are extracted, if extractUntranslatable="yes". To exclude certain contents, you can use the following rule and "misuse" the localeFilterList ITS attribute:
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options extractUntranslatable="yes"/> <its:localeFilterRule selector="//yourTagThatShouldNotBeExtracted" localeFilterList="!*"/> </its:rules>
Limitations
- Currently, in some cases, the ITS rule
withinTextRule
with the valuenested
may act like it has a valueyes
instead. - In output, the values of the
xml:lang
attributes are not updated to reflect the target language. - When doing the extraction, the whole input file is loaded into memory. You may run into memory limitation if the document is very large.