Okapi Framework - Filters

XML Filter (BETA)

- Overview
- ITS Support
- ITS Extensions
- Filter Options
- Processing Details
- Parameters

If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://www.opentag.com/okapi/wiki/index.php?title=XML_Filter

Overview

This filter allows you to process XML documents.

The following is an example of a simple XML document. The translatable text is underlined. Because each format based on XML is different, you need information on what are the translatable parts, what are the inline elements, etc. The XML Filter implements the ITS W3C Recommendation to address this issue.

<?xml version="1.0" encoding="utf-8"?>
<myDoc>
 <prolog>
  <author>Zebulon Fairfield</author>
  <version>version 12, revision 2 - 2006-08-14</version>
  <keywords><kw>horse</kw><kw>appaloosa</kw></keywords>
  <storageKey>articles-6D272BA9-3B89CAD8</storageKey>
 </prolog>
 <body>
  <title>Appaloosa</title>
  <p>The Appaloosas are rugged horses originally breed by 
the <kw>Nez-Perce</kw> tribe in the US Northwest.</p>
  <p>They are often characterized by their spotted coats.</p>
 </body>
</myDoc>

This filter is implemented in the class net.sf.okapi.filters.xml.XMLFilter of the Okapi library.

ITS Support

The Internationalization Tag set (ITS ) is a W3C recommendation that defines a set of elements and attributes you can use to specify different internationalization- and localization-related aspects of your XML document, for instance: define what attribute values are translatable, what element content should be protected, what element should be treated as a nested sub-flow of text, and much more.

The ITS specification is available at http://www.w3.org/TR/its/.

By default the filter process the XML documents based on the ITS defaults. That is: the content of all elements is translatable, and none of the values of the attribute are. To modify this behavior you need to associate the document with ITS rules. This can be done different ways:

When processing a document, the filter...

  1. Assumes that all element content is translatable, and none of the attribute values are translatable.
  2. Applies the global rules found in the (optional) parameters file associated with the input document.
  3. Applies the global rules found in the document.
  4. And finally, applies the local rules within the document.

For example, assuming that ITSForDoc.xml is the ITS file associated with the input file Document.xml, the translatable text is listed below.

ITSForDoc.xml:

<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0">
 <its:translateRule selector="//head" translate="no"/>
 <its:withinTextRule selector="//b|//code|//img" withinText="yes"/>
</its:rules>

Document.xml:

<doc>
 <head>
  <update>2009-03-21</update>
  <author>Mirabelle McIntosh</author>
 </head>
 <body>
  <p>Paragraph with <img ref="eg.png"/> and <b>bolded text</b>.</p>
  <p>Paragraph with <code>data codes</code> and text.</p>
 </body>
</doc>

The resulting text units are (in XLIFF 1.2 notation):

1: "Paragraph with <x id='1'> and <g id='2'>bolded text</g>."
2: "Paragraph with <g id='1'><x id='2'/></g> and text."

ITS Extensions

The filter supports extensions to the ITS specification. These extension use the namespace URI http://www.w3.org/2008/12/its-extensions.

idValue and xml:id

When the attribute xml:id is found on a translatable element, it is used as the name of the text unit generated for that element.

For example, in the example below, the resource name associated with the text unit for the <p> element is "id1".

<p xml:id="id1">Text</p>

The attribute idValue used in the ITS translateRule element allows you to define an XPath expression that correspeonds to the identifier value for the given selection. The value of idValue must be an expression that can return a string. A node location is a valid expression: it will return the value of the first node at the given location.

For example, in the example below, the resource name associated with the text unit for the <p> element is "id1".

<doc>
 <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
  xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
  <its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/>
 </its:rules>
 <p name="id1">text 1</p>
</doc>

Note that xml:id has precedence over idValue declaration. For example, in the example below, the resource name associated with the text unit for the <p> element is "xid1", not "id1".

<doc>
 <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
  xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
  <its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/>
 </its:rules>
 <p xml:id="xid1" name="id1">text 1</p>
</doc>

You can build complex ID based on different attributes, element or event hard-coded text. Any of the String functions offered by XPath can be used.

For example, in the file below, the two elements <text> and <desc> are translatable, but they have only one corresponding ID, the name attribute in their parent element. To make sure you have a unique identifier for both the content of <text> and the content of <desc>, you can use the rules set in the example. The XPath expression "concat(../@name, '_t')" will give the ID "id1_t" and the expression "concat(../@name, '_d')" will give the ID "id1_d".

<doc>
 <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
  xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
  <its:translateRule selector="//text" translate="yes" itsx:idValue="concat(../@name, '_t')"/>
  <its:translateRule selector="//desc" translate="yes" itsx:idValue="concat(../@name, '_d')"/>
 </its:rules>
 <msg name="id1">
  <text>Value of text</text>
  <desc>Value of desc</desc>
 </msg>
</doc>

whiteSpaces

The extension attribute whiteSpaces allows you to apply globally the equivalent of a local xml:space attribute.

For example, if you have a format where all element <pre> must have their spaces, tabs and line breaks preserved, you can specify the whiteSpaces="preserve" attribute to a <its:translateRule> element for the <pre> elements. In the example below, the spaces in the <pre> element will be preserved on extraction.

<doc>
 <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
  xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
  <its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/>
 </its:rules>
 <pre>Some txt with    many spaces.  </pre>
</doc>

Note that the xml:space has precedence over whiteSpaces. For example, in the following example, the white spaces in the content of <pre> may not be preserved.

<doc>
 <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
  xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
  <its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/>
 </its:rules>
 <pre xml:space="default">Some txt with    many spaces.  </pre>
</doc>

Filter Options

The filter supports also options in addition to ITS and ITS extension. These options use the namespace URI okapi-framework:xmlfilter-options.

Important: The filter options must be placed in the parameters file (.fprm) used with the filter, not in embedded or linked ITS rules. Options placed in embedded or linked ITS rules will have no effect.

lineBreakAsCode

In some cases the content of element includes line-breaks that need to be included as part of the content but without using an actual line-break in the extracted text. For example in some XML documents generated by Excel, the formatting of the cells is marked up with &#10; entity references. They need to be passed as inline codes.

By default this option is set to false.

To specify this the filter use the extension lineBreakAsCode extension attribute. This affect all the extracted content.

For example: The following code is an ITS document with the option to treat line-breaks as code. It can be used along with the example of XML document listed below.

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options lineBreakAsCode="yes"/>
</its:rules>
<doc>
 <data>line 1&#10;line 2.</data>
</doc>

codeFinder

You can define a set of regular expressions to capture span of extracted text that should be treated as inline codes. For example, some element content may have variables, or HTML tags that need to be protected from modification and treated as codes. Use the codeFinder element for this.

In the following parameters file, the codeFinder element defines two rules:

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:codeFinder useCodeFinder="yes">#v1
count.i=2
rule0=&lt;(/?)\w[^&lt;]*?&gt;
rule1=(#\w+?\#)|(%\d+?%)
 </okp:codeFinder>
</its:rules>

Some important details:

omitXMLDeclaration

By default an XML declaration is always set at the top of the output document (regardless wether the original document has one or not). It is an important part of the XML document and it is especially needed when the encoding of the output document is not UTF-8, UTF-16 or UTF-32, as its name must be specified in the XML declaration. However, there are a few special cases when the declaration is better left off. To hanlde those rare cases, you can use the omitXMLDeclation to not output the XML declaration.

For example:

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options omitXMLDeclaration="yes"/>
</its:rules>

Remember that XML documents without an XML declaration may be read incorrectly if the encoding of the document is not UTF-8, UTF-16 or UTF-32.

escapeQuotes

By default, when processing the document, the filter uses double-quotes to enclose all attributes (transltable or not) and use the following rules for escaping/not-escaping the litteral quotes:

You cannot change the escaping rules for attributes.

For element content: If the document is processed without triggering any rule that allow the translation of an attribute, then (and only then) the filter takes into account the escapeQuotes option to escape or not double-quotes in the translatable content.

For example, the following parameters file allows to not escape double-quotes in element content (for the documents where there is no rule for translatable attributes are triggered).

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options escapeQuotes="no"/>
</its:rules>

escapeGT

By default the character '>' is escaped. You can indicate to the filter to not escape it using the escapeGT option.

For example, the following parameters file indicates to not escape greater-than characters.

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options escapeGT="no"/>
</its:rules>

escapeNbsp

By default the non-breaking space character is escaped (in the form &#x00a0;). You can indicate to the filter to not escape it using the escapeNbsp option.

For example, the following parameters file indicates to not escape the non-breaking space characters.

<its:rules version="1.0"
 xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:okp="okapi-framework:xmlfilter-options">
 <okp:options escapeNbsp="no"/>
</its:rules>

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the following logic:

Output Encoding

If the output encoding is UTF-8:

If the original document had an XML encoding declaration it is updated, if it did not, one is automatically added.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Parameters

The parameters for the XML filter are stored in an ITS document.