Okapi Framework - FiltersXML Filter (BETA) |
|
- Overview | |
If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://www.opentag.com/okapi/wiki/index.php?title=XML_Filter
This filter allows you to process XML documents.
The following is an example of a simple XML document. The translatable text is underlined. Because each format based on XML is different, you need information on what are the translatable parts, what are the inline elements, etc. The XML Filter implements the ITS W3C Recommendation to address this issue.
<?xml version="1.0" encoding="utf-8"?> <myDoc> <prolog> <author>Zebulon Fairfield</author> <version>version 12, revision 2 - 2006-08-14</version> <keywords><kw>horse</kw><kw>appaloosa</kw></keywords> <storageKey>articles-6D272BA9-3B89CAD8</storageKey> </prolog> <body> <title>Appaloosa</title> <p>The Appaloosas are rugged horses originally breed by the <kw>Nez-Perce</kw> tribe in the US Northwest.</p> <p>They are often characterized by their spotted coats.</p> </body> </myDoc>
This filter is implemented in the class
net.sf.okapi.filters.xml.XMLFilter of the Okapi library.
The Internationalization Tag set (ITS ) is a W3C recommendation that defines a set of elements and attributes you can use to specify different internationalization- and localization-related aspects of your XML document, for instance: define what attribute values are translatable, what element content should be protected, what element should be treated as a nested sub-flow of text, and much more.
The ITS specification is available at http://www.w3.org/TR/its/.
By default the filter process the XML documents based on the ITS defaults. That is: the content of all elements is translatable, and none of the values of the attribute are. To modify this behavior you need to associate the document with ITS rules. This can be done different ways:
When processing a document, the filter...
For example, assuming that ITSForDoc.xml is the ITS file
associated with the input file Document.xml, the translatable text
is listed below.
ITSForDoc.xml:
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0"> <its:translateRule selector="//head" translate="no"/> <its:withinTextRule selector="//b|//code|//img" withinText="yes"/> </its:rules>
Document.xml:
<doc> <head> <update>2009-03-21</update> <author>Mirabelle McIntosh</author> </head> <body> <p>Paragraph with <img ref="eg.png"/> and <b>bolded text</b>.</p> <p>Paragraph with <code>data codes</code> and text.</p> </body> </doc>
The resulting text units are (in XLIFF 1.2 notation):
1: "Paragraph with <x id='1'> and <g id='2'>bolded text</g>." 2: "Paragraph with <g id='1'><x id='2'/></g> and text."
The filter supports extensions to the ITS specification. These extension use
the namespace URI http://www.w3.org/2008/12/its-extensions.
When the attribute xml:id is found on a translatable element, it
is used as the name of the text unit generated for that element.
For example, in the example below, the resource name associated with the text
unit for the <p> element is "id1".
<p xml:id="id1">Text</p>
The attribute idValue used in the ITS translateRule
element allows you to define an XPath expression that correspeonds to the identifier value for the
given selection. The value of idValue must be an expression that
can return a string. A node location is a valid expression: it will return the
value of the first node at the given location.
For example, in the example below, the resource name associated with the text
unit for the <p> element is "id1".
<doc> <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/2008/12/its-extensions"> <its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/> </its:rules> <p name="id1">text 1</p> </doc>
Note that xml:id has precedence over idValue
declaration. For example, in the example below, the resource name associated
with the text unit for the <p> element is "xid1", not
"id1".
<doc> <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/2008/12/its-extensions"> <its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/> </its:rules> <p xml:id="xid1" name="id1">text 1</p> </doc>
You can build complex ID based on different attributes, element or event hard-coded text. Any of the String functions offered by XPath can be used.
For example, in the file below, the two elements <text> and
<desc> are translatable, but they have only one corresponding ID,
the name attribute in their parent element. To make sure you have a
unique identifier for both the content of <text> and the content of
<desc>, you can use the rules set in the example. The XPath
expression "concat(../@name, '_t')" will give the ID "id1_t"
and the expression "concat(../@name, '_d')" will give the ID "id1_d".
<doc> <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/2008/12/its-extensions"> <its:translateRule selector="//text" translate="yes" itsx:idValue="concat(../@name, '_t')"/> <its:translateRule selector="//desc" translate="yes" itsx:idValue="concat(../@name, '_d')"/> </its:rules> <msg name="id1"> <text>Value of text</text> <desc>Value of desc</desc> </msg> </doc>
The extension attribute whiteSpaces allows you to apply globally
the equivalent of a local xml:space attribute.
For example, if you have a format where all element <pre> must have their
spaces, tabs and line breaks preserved, you can specify the whiteSpaces="preserve"
attribute to a <its:translateRule> element for the <pre>
elements. In the example below, the spaces in the <pre> element
will be preserved on extraction.
<doc> <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/2008/12/its-extensions"> <its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/> </its:rules> <pre>Some txt with many spaces. </pre> </doc>
Note that the xml:space has precedence over whiteSpaces.
For example, in the following example, the white spaces in the content of
<pre> may not be preserved.
<doc> <its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/2008/12/its-extensions"> <its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/> </its:rules> <pre xml:space="default">Some txt with many spaces. </pre> </doc>
The filter supports also options in addition to ITS and ITS extension. These
options use
the namespace URI okapi-framework:xmlfilter-options.
Important: The filter options must be placed in the parameters file (.fprm)
used with the filter, not in embedded or linked ITS rules. Options placed in
embedded or linked ITS rules will have no effect.
In some cases the content of element includes line-breaks that need to be
included as part of the content but without using an actual line-break in the
extracted text. For example in some XML documents generated by Excel, the
formatting of the cells is marked up with entity references.
They need to be passed as inline codes.
By default this option is set to false.
To specify this the filter use the extension lineBreakAsCode
extension attribute. This affect all the extracted content.
For example: The following code is an ITS document with the option to treat line-breaks as code. It can be used along with the example of XML document listed below.
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options lineBreakAsCode="yes"/> </its:rules>
<doc> <data>line 1 line 2.</data> </doc>
You can define a set of regular expressions to capture span of extracted text
that should be treated as inline codes. For example, some element content may
have variables, or HTML tags that need to be protected from modification and
treated as codes. Use the codeFinder element for this.
In the following parameters file, the codeFinder element defines two rules:
rule0) is "<(/?)\w[^>]*?>" and
matches any XML-type tags (e.g. "<b>", "</b>", "<br/>")rule1) is "(#\w+?\#)|(%\d+?%)"
and matches any word enclosed in # (e.g. "#VAR#") or number
enclosed in % (e.g. "%1%").<its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:codeFinder useCodeFinder="yes">#v1
count.i=2
rule0=<(/?)\w[^<]*?>
rule1=(#\w+?\#)|(%\d+?%)
</okp:codeFinder>
</its:rules>
Some important details:
useCodeFinder to "yes" to have the rules
used, if the attribute is missing its value is assumed to be "no".<codeFinder> element
content is #v1.count.i=N must be before any rules and N must
be the number of rules.ruleN must be incremented starting at 0.<(/?)\w[^>]*?>"
must be entered "<(/?)\w[^<]*?>" in the parameters
file.count.i or ruleN, and
not after your expressions.By default an XML declaration is always set at the top of the output document (regardless wether the original document has one or not). It is an important part of the XML document and it is especially needed when the encoding of the output document is not UTF-8, UTF-16 or UTF-32, as its name must be specified in the XML declaration. However, there are a few special cases when the declaration is better left off. To hanlde those rare cases, you can use the omitXMLDeclation to not output the XML declaration.
For example:
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options omitXMLDeclaration="yes"/> </its:rules>
Remember that XML documents without an XML declaration may be read incorrectly if the encoding of the document is not UTF-8, UTF-16 or UTF-32.
By default, when processing the document, the filter uses double-quotes to enclose all attributes (transltable or not) and use the following rules for escaping/not-escaping the litteral quotes:
You cannot change the escaping rules for attributes.
For element content: If the document is processed without triggering any rule
that allow the translation of an attribute, then (and only then) the filter
takes into account the escapeQuotes option to escape or not
double-quotes in the translatable content.
For example, the following parameters file allows to not escape double-quotes in element content (for the documents where there is no rule for translatable attributes are triggered).
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options escapeQuotes="no"/> </its:rules>
By default the character '>' is escaped. You can indicate to the filter to
not escape it using the escapeGT option.
For example, the following parameters file indicates to not escape greater-than characters.
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options escapeGT="no"/> </its:rules>
By default the non-breaking space character is escaped (in the form  ). You can indicate
to the filter to not escape it using the escapeNbsp option.
For example, the following parameters file indicates to not escape the non-breaking space characters.
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its" xmlns:okp="okapi-framework:xmlfilter-options"> <okp:options escapeNbsp="no"/> </its:rules>
The filter decides which encoding to use for the input document using the following logic:
If the output encoding is UTF-8:
If the original document had an XML encoding declaration it is updated, if it did not, one is automatically added.
The type of line-breaks of the output is the same as the one of the original input.
The parameters for the XML filter are stored in an ITS document.