How to Create a Custom Configuration for the XML Filter

From Okapi Framework
Jump to: navigation, search

Imagine that you need to process an XML document. By default the XML Filter uses the default ITS rules for this:

  • The content of each element is extracted as a separate text unit.
  • No attribute value is extracted.

This may be fine for some file, but most likely you will have to adjust those rules for your documents.

This is done by creating a custom configuration.

For example, in the document shown below, using the default rules would result in extracting the text highlighted in yellow:

<?xml version="1.0"?>
<doc>
 <p pid="p1">This is the picture of Paul's <b>dopplergänger</b>: 
  <img src="pict.png" alt="Paul's dopplergänger"/></p>
 <p pid="p2">In this photo he is eating marionberry jam. The marionberry is pretty good.
  It's a cross between <b>olallieberry</b> and <b>Chehalem blackberry</b>. 
  <reviewNote author="Moe">Too much details maybe?</reviewNote>
  The olallieberry is a cross between the <i>loganberry</i> and the <i>youngberry</i>.
  The Chehalem blackberry is a cross between the <i>Himalayan blackberry</i> and the <i>Santiam berry</i>.
 </p>
 <p>This figure shows the "heritage" of the marionberry:</p>
 <drawing ref="http://en.wikipedia.org/wiki/Marionberry">
                           Raspberry   Blackberry   Dewberry
                               ^         ^   ^         ^
                               |         |   |         |
                               +----+----+   +----+----+
                                    |             |
       Pacific blackberry       Loganberry    Youngberry
                ^                  ^  ^           ^
                |                  |  |           |
                +---------+--------+  +-----+-----+
                          |                 |
Himalayan berry     Santiam berry           |
        ^                 ^                 |
        |                 |                 |
        +--------+--------+                 |
                 |                          |
          Chehlem blackberry          Olallieberry
                    ^                       ^
                    |                       |
                    +-----------+-----------+
                                |
                          Marionberry
</drawing>
 <p>This simply means that the marionberry is a variety of blackberry!</p>
</doc>

The issues are:

  • The content of <reviewNote> should not be extracted: It's an authoring note embedded in the document.
  • The value of the attribute alt in the <img> element should be extracted.
  • The elements <img>, <b>, <i> and <reviewNote> should be treated as inline code, and not as structural codes that break paragraphs.
  • The formatting of the content of the <drawing> element needs to be preserved.
  • The <p> elements have a unique identifier in their pid attribute. It would be nice to be able to carry that value along with the text unit that holds the extracted content for a given <p> element.

The following step describes how to create a configuration to resolve these issues.

1. Start Rainbow.

2. Drop the document to process in the Input List 1 tab of the main window.

3. Select the command Input > Edit Document properties. This opens the Input Document Properties dialog.

4. Select XML Filter in the Filter configuration list. This displays the list of the pre-defined configurations for the filter, as well as any existing customs available.

5. Click on Create, enter the name of the new customized configuration, for example "myFormat" and click OK. This creates a parameters file named okf_xml@myFormat.fprm. It will also opens a default text box with an ITS template file you can use to create your rules.

An ITS file is an XML file where you specify rules that drive the extraction.

Extraction

The first adjustment is to tell the filter to not extract the content of <reviewNote>. This is done using the following <translateRule>:

<its:translateRule selector="//reviewNote" translate="no"/>

The value for selector is an XPath expression that specifies the nodes (i.e. elements or attributes) to which the rule applies. Here, the expression //reviewNote means: "any element named 'reviewNote' anywhere in the document". The value for translate indicates if the selected nodes should be translated or not.

XPath is a powerful language, the version used in ITS 1.0 is XPath 1.0.

The next adjustment is to tell the filter to extract the value of the alt attribute in any <img> element. This is done using the following <translateRule>:

<its:translateRule selector="//*/@alt" translate="yes"/>

Here the selector means: "the attribute 'alt' in any element, anywhere in the document". If you want to limit the extract to only the alt attributes in <img> you can use selector="//img/@alt".

Inline Codes

The next adjustment is to tell the filter to treat the elements <img>, <b>, <i> and <reviewNote> as inline codes, that is elements that should be extracted along with the surrounding text and protected. In other words, they should not indicate the end of a paragraph. In XLIFF they would be represented as content of Inline elements. This is done with the following <withinTextRule>:

<its:withinTextRule selector="//img|//b|//i|//reviewNote" withinText="yes"/> 

The selector means: "any element named 'img' or 'b' or 'i' or 'reviewNote', anywhere in the document". You could be more general and, if you know for sure that any elements inside a <p> element should be treated as inline code, you could use selector="//p/descendant::*" which means: "any descendant of any element named 'p', anywhere is the document". Be sure to know very well your XML format when using broad expressions like this as what you have in your files may be just a sub-set of the XML format.

At this point you have resolve the two extraction issues, as well as identified the inline codes. Your ITS file should look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its"
 version="1.0">
 <its:translateRule selector="//reviewNote" translate="no"/>
 <its:translateRule selector="//*/@alt" translate="yes"/>
 <its:withinTextRule selector="//img|//b|//i|//reviewNote" withinText="yes"/>
</its:rules>

Whitespaces Handling

The next adjustment is to make sure the filter preserve the formatting of the content in the <drawing> element. Normally in XML, the standard attribute xml:space="preserve" is used to indicate such thing. But many formats do not use it and rely on the application to know specific elements should have the whitespaces of their content preserve.

ITS does not have spacial rule for this, as it also relies on xml:space="preserve". But the ITS Interest Group has defined a set of extensions to ITS 1.0, with the potential to make them part of the standard in a future version. The XML Filter implements some of these extensions. This case is handled by adding the following rule:

<its:translateRule selector="//drawing" translate="yes" itsx:whiteSpaces="preserve"/>

The extension attribute itsx:whiteSpaces="preserve" indicates that the filter should preserve the whitespaces in the content of the nodes selected by the rule. Note that, because whiteSpaces is not part of ITS, it needs to be identified as part of the extension namespace. So you need to make sure xmlns:itsx="http://www.w3.org/2008/12/its-extensions" is declared in the root element of your ITS file.

Unique IDs

The last adjustment is to indicate tha the value of the pid attribute is the unique ID that identifies a <p> element. There is also an ITS extension for this. It is done by adding the following rule:

<its:translateRule selector="//p" translate="yes" itsx:idValue="@pid"/>

The extension attribute itsx:idValue="preserve" is an expression that gives the value of the ID for the node selected. In this case, it's simply the value of the pid attribute.

Final Result

You should have now an ITS file that has all the rules needed to process your document.

<?xml version="1.0" encoding="UTF-8"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its"
 xmlns:itsx="http://www.w3.org/2008/12/its-extensions"
 version="1.0">
 <its:translateRule selector="//reviewNote" translate="no"/>
 <its:translateRule selector="//*/@alt" translate="yes"/>
 <its:withinTextRule selector="//img|//b|//i|//reviewNote" withinText="yes"/>
 <its:translateRule selector="//drawing" translate="yes" itsx:whiteSpaces="preserve"/>
 <its:translateRule selector="//p" translate="yes" itsx:idValue="@pid"/>
</its:rules>

With this custom configuration, if you extract to XLIFF the XML document we used as example, you should get something like this:

<?xml version="1.0" encoding="UTF-8"?>
 <xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
 <file original="test10.xml" source-language="en-us" target-language="fr-fr" datatype="xml">
  <body>
   <trans-unit id="1">
    <source xml:lang="en-us">Paul's dopplergänger</source>
   </trans-unit>
   <trans-unit id="2" resname="p1">
    <source xml:lang="en-us">This is the picture of Paul's <g id="1">dopplergänger</g>:
 <x id="2"/></source>
   </trans-unit>
   <trans-unit id="3" resname="p2">
    <source xml:lang="en-us">In this photo he is eating marionberry jam. The marionberry
 is pretty good. It's a cross between <g id="1">olallieberry</g> and <g id="2">Chehalem
 blackberry</g>. <g id="3"><x id="4"/></g> The olallieberry is a cross between the
 <g id="5">loganberry</g> and the <g id="6">youngberry</g>. The Chehalem blackberry is a
 cross between the <g id="7">Himalayan blackberry</g> and the
 <g id="8">Santiam berry</g>.</source>
   </trans-unit>
   <trans-unit id="4">
    <source xml:lang="en-us">This figure shows the "heritage" of the marionberry:</source>
   </trans-unit>
   <trans-unit id="5" xml:space="preserve">
    <source xml:lang="en-us">
                           Raspberry   Blackberry   Dewberry
                               ^         ^   ^         ^
                               |         |   |         |
                               +----+----+   +----+----+
                                    |             |
       Pacific blackberry       Loganberry    Youngberry
                ^                  ^  ^           ^
                |                  |  |           |
                +---------+--------+  +-----+-----+
                          |                 |
Himalayan berry     Santiam berry           |
        ^                 ^                 |
        |                 |                 |
        +--------+--------+                 |
                 |                          |
          Chehlem blackberry          Olallieberry
                    ^                       ^
                    |                       |
                    +-----------+-----------+
                                |
                          Marionberry
    </source>
   </trans-unit>
   <trans-unit id="6">
    <source xml:lang="en-us">This simply means that the marionberry is a variety
 of blackberry!</source>
   </trans-unit>
  </body>
 </file>
</xliff>

List of Elements

Note that you can get a list of all the elements in a set of XML documents using the XML Analysis Step.

The utility also tries to guess which elements should be treated as "elements within text" (inline) and which ones are structural.