XML Filter

From Okapi Framework
Jump to navigation Jump to search

Overview

This filter allows you to process XML documents. It uses a DOM-based parser, which allows it to implement ITS. If you need to process very large XML documents and have no need for ITS, you may want to look at using the XML Stream Filter.

The following is an example of a simple XML document. The translatable text is highlighted. Because each format based on XML is different, you need information on what are the translatable parts, what are the inline elements, etc. The XML Filter implements the ITS W3C Recommendation to address this issue.

<?xml version="1.0" encoding="utf-8"?>
<myDoc>
 <prolog>
  <author>Zebulon Fairfield</author>
  <version>version 12, revision 2 - 2006-08-14</version>
  <keywords><kw>horse</kw><kw>appaloosa</kw></keywords>
  <storageKey>articles-6D272BA9-3B89CAD8</storageKey>
 </prolog>
 <body>
  <title>Appaloosa</title>
  <p>The Appaloosas are rugged horses originally breed by 
the <kw>Nez-Perce</kw> tribe in the US Northwest.</p>
  <p>They are often characterized by their spotted coats.</p>
 </body>
</myDoc>

This filter is implemented in the class net.sf.okapi.filters.xml.XMLFilter of the library.

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the following logic:

  • If the document has an encoding declaration it is used.
  • Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).

Output Encoding

If the output encoding is UTF-8:

  • If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
  • If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

If the original document had an XML encoding declaration it is updated, if it did not, one is automatically added.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Parameters

This filter stores its parameters in an XML file and does not provide an editor to modify it. You can edit the file in a simple text editor, or with an XML editor. For an example, see the article "How to Create a Custom Configuration for the XML Filter".

ITS Support

By default the filter process the XML documents based on the ITS defaults. That is:

  • the content of all elements is translatable,
  • and none of the values of the attribute translatable.

Different behavior can occur if the input document contains ITS markup, or if a filter parameters file is specified. The parameters file used by the the XML Filter is an ITS document.

The Internationalization Tag set (ITS) is a W3C recommendation that defines a set of elements and attributes you can use to specify different internationalization- and localization-related aspects of your XML document, for instance: ITS defines what attribute values are translatable, what element content should be protected, what element should be treated as a nested sub-flow of text, and much more.

The filter supports ITS 1.0 and ITS 2.0 (2.0 is backward compatible with 1.0)

See the "ITS" page for more details on the format.

The filter supports global and local rules and most data categories. See the ITS Components page for a detailed list of how the data categories are supported and other information on the implementation.

ITS Extensions

The filter supports extensions to the ITS specification. These extension use the namespace URI http://www.w3.org/2008/12/its-extensions.

idValue and xml:id

Note: This extension was defined for ITS 1.0, ITS 2.0 offers the new Id Value data category that should be used instead of this extension.

When the attribute xml:id is found on a translatable element, it is used as the name of the text unit generated for that element.

For example, in the example below, the resource name associated with the text unit for the <p> element is "id1".

<p xml:id="id1">Text</p>

The attribute idValue used in the ITS translateRule element allows you to define an XPath expression that correspeonds to the identifier value for the given selection. The value of idValue must be an expression that can return a string. A node location is a valid expression: it will return the value of the first node at the given location.

For example, in the example below, the resource name associated with the text unit for the <p> element is "id1":

<doc>
 <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
  xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
  <its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/>
 </its:rules>
 <p name="id1">text 1</p>
</doc>

Note that xml:id has precedence over idValue declaration. For example, in the example below, the resource name associated with the text unit for the <p> element is "xid1", not "id1".

<doc>
 <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
  xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
  <its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/>
 </its:rules>
 <p xml:id="xid1" name="id1">text 1</p>
</doc>

You can build complex ID based on different attributes, element or event hard-coded text. Any of the String functions offered by XPath can be used.

For example, in the file below, the two elements &tl;text> and <desc> are translatable, but they have only one corresponding ID, the name attribute in their parent element. To make sure you have a unique identifier for both the content of <text> and the content of <desc>, you can use the rules set in the example. The XPath expression "concat(../@name, '_t')" will give the ID "id1_t" and the expression "concat(../@name, '_d')" will give the ID "id1_d".

<doc>
 <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
  xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
  <its:translateRule selector="//text" translate="yes" itsx:idValue="concat(../@name, '_t')"/>
  <its:translateRule selector="//desc" translate="yes" itsx:idValue="concat(../@name, '_d')"/>
 </its:rules>
 <msg name="id1">
  <text>Value of text</text>
  <desc>Value of desc</desc>
 </msg>
</doc>

whiteSpaces

Note: This extension was defined for ITS 1.0, ITS 2.0 offers the new Preserve Space data category that should be used instead of this extension.

The extension attribute whiteSpaces allows you to apply globally the equivalent of a local xml:space attribute.

For example, if you have a format where all element <pre> must have their spaces, tabs and line breaks preserved, you can specify the attribute whiteSpaces="preserve" in a <its:translateRule> element for the <pre> elements. In the example below, the spaces in the <pre> element will be preserved on extraction.

<doc>
 <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
   xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
  <its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/>
 </its:rules>
 <pre>Some txt with    many spaces.  </pre>
</doc>

Note that the xml:space attribute has precedence over whiteSpaces. For example, in the following example, the white spaces in the content of <pre> may not be preserved because the attribute xml:space has the value default:

<doc>
 <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
   xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
  <its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/>
 </its:rules>
 &<pre xml:space="default">Some txt with    many spaces.  </pre>
</doc>

Filter Options

The filter supports also options in addition to ITS and ITS extension. These options use the namespace URI okapi-framework:xmlfilter-options.

Note: The filter options must be placed in the parameters file (.fprm) used with the filter, not in embedded or linked ITS rules. Options placed in embedded or linked ITS rules have no effect.

When you use several options, they must be set in a single <okp:options> element, as shown below:

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options lineBreakAsCode="yes"
              escapeQuotes="no"
              escapeGT="yes"
 />
</its:rules>


The following options are available:

lineBreakAsCode

In some cases the content of element includes line-breaks that need to be included as part of the content but without using an actual line-break in the extracted text. For example in some XML documents generated by Excel, the formatting of the cells is marked up with &#10; entity references. They need to be passed as inline codes.

By default this option is set to false.

To specify this the filter use the extension lineBreakAsCode extension attribute. This affect all the extracted content.

For example: The following code is an ITS document with the option to treat line-breaks as code. It can be used along with the example of XML document listed below.

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options lineBreakAsCode="yes"/>
</its:rules>
<doc>
 line 1&#10;line 2.
</doc>

codeFinder

You can define a set of regular expressions to capture span of extracted text that should be treated as inline codes. For example, some element content may have variables, or HTML tags that need to be protected from modification and treated as codes. Use the codeFinder element for this.

In the following parameters file, the codeFinder element defines two rules:

  • The first one (rule0) is "<(/?)\w[^>]*?>" and matches any XML-type tags (e.g. "<b>", "</b>", "<br/>")
  • The second one (rule1) is "(#\w+?\#)|(%\d+?%)" and matches any word enclosed in # (e.g. "#VAR#") or number enclosed in % (e.g. "%1%").
<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:codeFinder useCodeFinder="yes">#v1
count.i=2
rule0=&lt;(/?)\w+[^&gt;]*?&gt;
rule1=(#\w+?\#)|(%\d+?%)
 </okp:codeFinder>
</its:rules>

Some important details:

  • Set useCodeFinder to "yes" to have the rules used, if the attribute is missing its value is assumed to be "no".
  • Make sure the first line of the <codeFinder> element content is #v1.
  • Each entry in the content must be on a separate line.
  • count.i=N must be before any rules and N must be the number of rules.
  • ruleN must be incremented starting at 0.
  • The pattern for a rule must be escaped for XML, for example: "<(/?)\w[^>]*?>" must be entered "&lt;(/?)\w[^&lt;]*?&gt;" in the parameters file.
  • Do not put spaces before count.i or ruleN, and not after your expressions.

To facilitate the creation of code finder rules Rainbow provides the Code Finder Editor.

omitXMLDeclaration

By default an XML declaration is always set at the top of the output document (regardless wether the original document has one or not). It is an important part of the XML document and it is especially needed when the encoding of the output document is not UTF-8, UTF-16 or UTF-32, as its name must be specified in the XML declaration. However, there are a few special cases when the declaration is better left off. To handle those rare cases, you can use omitXMLDeclation to indicate the filter to not output the XML declaration.

For example:

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options omitXMLDeclaration="yes"/>
</its:rules>

Remember that XML documents without an XML declaration may be read incorrectly if the encoding of the document is not UTF-8, UTF-16 or UTF-32.

escapeQuotes

By default, when processing the document, the filter uses double-quotes to enclose all attributes (translatable or not) and use the following rules for escaping/not-escaping the literal quotes:

  • Inside the attribute values:
    • Single-quotes (=apostrophes) are never escaped
    • Double-quotes are always escaped
  • In element content:
    • Single-quotes (=apostrophes) are not escaped
    • Double-quotes are escaped by default

You cannot change the escaping rules for attributes.

For element content: If the document is processed without triggering any rule that allow the translation of an attribute, then (and only then) the filter takes into account the escapeQuotes option to escape or not double-quotes in the translatable content.

For example, the following parameters file allows to not escape double-quotes in element content (for the documents where there is no rule for translatable attributes are triggered):

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options escapeQuotes="no"/>
</its:rules>

escapeGT

By default the character '>' is escaped. You can indicate to the filter to not escape it using the escapeGT option.

For example, the following parameters file indicates to not escape greater-than characters:

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options escapeGT="no"/>
</its:rules>

escapeNbsp

By default the non-breaking space character is escaped (in the form &#x00a0;). You can indicate to the filter to not escape it using the escapeNbsp option.

For example, the following parameters file indicates to not escape the non-breaking space characters:

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options escapeNbsp="no"/>
</its:rules>

extractIfOnlyCodes

By default all extractable entries are extracted even when they contain only white-spaces and/or inline codes. You can indicate to the filter to not extract such entries using the extractIfOnlyCodes option.

For example, the following parameters file indicates to not extract entries with only whte-spaces and/or inline codes:

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options extractIfOnlyCodes="no"/>
</its:rules>

inlineCdata

By default, CDATA sections will be exposed as regular content, and the CDATA markers themselves will be discarded. When the inlineCdata option is set, the CDATA markers will be exposed as inline codes.

For example, the following parameters file will expose CDATA markers as inline codes:

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options inlineCdata="yes"/>
</its:rules>

extractUntranslatable

All untranslatable entries (its:translate="no") are not extracted by default. And in order to allow the extraction of such entries for context reasons, the following option has to be used: extractUntranslatable.

Below is an example of this option declaration:

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options extractUntranslatable="yes"/>
</its:rules>

With this option contents that are untranslatable will be extracted, but marked as translate="no" in xliff.

Hint: If you want to extract certain untranslatable contents and others not: By default all untranslatable contents are extracted, if extractUntranslatable="yes". To exclude certain contents, you can use the following rule and "misuse" the localeFilterList ITS attribute:

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options extractUntranslatable="yes"/>
<its:localeFilterRule selector="//yourTagThatShouldNotBeExtracted" localeFilterList="!*"/>
</its:rules>

Limitations

  • Currently, in some cases, the ITS rule withinTextRule with the value nested may act like it has a value yes instead.
  • In output, the values of the xml:lang attributes are not updated to reflect the target language.
  • When doing the extraction, the whole input file is loaded into memory. You may run into memory limitation if the document is very large.