Okapi Framework - User contributions [en]

SRX

2026-01-10T08:24:08Z

Translate5Support:

{{Standards Common Menu}}
__TOC__
==Overview==

The SRX (Segmentation Rules eXchange) format is a standard to save segmentation rules in a file so they can be used between different tools.

It originally maintained by the OSCAR special interest group of the Localisation Industry Standards Association (LISA). In March 2011 LISA was closed and its standards moved under Creative Commons license.

The version 2.0 is the latest version of the specification and can be found here: http://www.gala-global.org/oscarStandards/srx/srx20.html.

SRX rules are grouped into named sets that are activated based the code of the language of the text to process. Each rule defines the text parts before and after the inter-segment location, and specifies if the location should be a break or not. The text parts are defined using [[Regular Expressions|regular expressions]].

Example of SRX simple rules:

<pre><?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20" version="2.0">
<header segmentsubflows="yes" cascade="no">
<formathandle type="start" include="no"></formathandle>
<formathandle type="end" include="yes"></formathandle>
<formathandle type="isolated" include="no"></formathandle>
</header>
<body>
<languagerules>
<languagerule languagerulename="default">
<rule break="no">
<beforebreak>([A-Z]\.){2,}</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="yes">
<beforebreak>\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
</languagerule>
</languagerules>
<maprules>
<languagemap languagepattern=".*" languagerulename="default"></languagemap>
</maprules>
</body>
</srx></pre>

In this example, there are two rules.

The second one specifies that when an inter-character location is preceded by a period and followed by a white space, the rule is to break at that position.

The first rule specifies that when an inter-character location is preceded by the patter <code>([A-Z]\.){2,}</code> and followed by a white paces, the rule is to not break at that position. Because the first rule is placed before the second rule it takes precedence.

So, based on those rules, the following text:

I'm in the U.K. for now. But I plan to move to Papua New Guinea.

will break down into two segments:

[I'm in the U.K. for now.]
[ But I plan to move to Papua New Guinea.]

If the first rule was not there, it would break down into three segments:

[I'm in the U.K.]
[ for now.]
[ But I plan to move to Papua New Guinea.]

==SRX Versions Issue==

There are two versions of SRX: 1.0 and 2.0.

SRX version 1.0 has been implemented by several tools that interpreted how to process the SRX rules in different ways. As a result the same SRX 1.0 document used on different tools may give you different segmentation.

To resolve this issue, an updated version 2.0 specification has been published and provides better implementation guidelines. So, in theory, the same version 2.0 document should give you the same segmentation in all tools.

You can find the specifications of SRX on the LISA web site:

* SRX 1.0: http://www.gala-global.org/oscarStandards/srx/srx10.html
* SRX 2.0: http://www.gala-global.org/oscarStandards/srx/srx20.html

===Implementation Differences for SRX 1.0===

There are two main types of implementations of SRX 1.0: the intended one, and one that use a cascading matching of the language maps.

Tools like SDLX implemented the intended SRX 1.0 behavior (non-cascading). Others, like Swordfish implemented SRX 1.0 with a cascading behavior.

In an SRX document, the segmentation rules are grouped into several <code><languagerule></code> elements. This way you can define different sets of rules that you apply for different languages. The select of which group of rules is to use for a given language is driven by a table defined in the <maprules> element. Each entry in <code><maprules></code> is a <code><languagemap></code>. This entry has two information: a regular expression pattern that corresponds to what language code should use the entry, and a pointer to the group of rules for this entry.

<pre><languagerules>
<languagerule languagerulename='default'>
</languagerule>
<languagerule languagerulename='japanese'>
</languagerule>
<languagerules>

<maprules>
<languagemap languagepattern='ja.*' languagerulename='japanese'/>
<languagemap languagepattern='.*' languagerulename='default'/>
</maprules></pre>

The difference between the SRX 1.0 implementations is how they lookup the <code><maprules></code> for a given language code.

# Some will use only the first <code><languagemap></code> that has a languagepattern matching the language code.
# Other will use all <code><languagemap></code> that have a <code>languagepattern</code> matching the language code.

The first interpretation is the correct one: In SRX 1.0 you use the only first <code><languagemap></code> that matches the given language code.

It is true that there is nothing in the SRX 1.0 specification that says explicitly it should work that way. But there is also nothing explicitly (or implicitly) that says all matching <code><languagemap></code> should be used.

The clue to the intended behavior is in the example of the SRX 1.0 specification:

<languagerules>
<languagerule languagerulename="Default">
 <rule break="no">
<beforebreak>^\s*[0-9]+\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
 <rule break="no">
<beforebreak>[Ee][Tt][Cc]\.</beforebreak>
<afterbreak>\s[a-z]</afterbreak>
</rule>
...
</languagerule>
<languagerule languagerulename="Japanese">
 <rule break="no">
<beforebreak>^\s*[0-9]+\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
 <rule break="no">
<beforebreak>[Ee][Tt][Cc]\.</beforebreak>
<afterbreak></afterbreak>
</rule>
<rule break="yes">
<beforebreak>[\xff61\x3002\xff0e\xff1f\xff01]+</beforebreak>
<afterbreak></afterbreak>
</rule>
...
</languagerule>
</languagerules>

<maprules>
<maprule maprulename="Default">
<languagemap languagepattern="JA.*" languagerulename="Japanese"/>
<languagemap languagepattern=".*" languagerulename="Default"/>
</maprule>
</maprules>

In this example, there is the same rules defined in both the Default and the Japanese groups. If SRX 1.0 intended to use all the <code><languagemap></code> elements that match the given language code, there would be no point to have duplicated rules in Japanese. The Japanese group would have only the extra Japanese-specific rules.

===How to Convert From SRX 1.0 to SRX 2.0?===

The SRX 2.0 specification resolve the cascading issue by making it an option.

When loading or importing SRX 1.0 documents into an SRX 2.0 editor, you must be careful about setting properly the cascade option depending on the provenance of the document.

* SRX 1.0 rules coming from Trados, SDLX and some other tools that implement the normal SRX 1.0 behavior (no cascading). So you should make sure that option is not set after you open the file.

* SRX 1.0 rules coming from Heartsome, Swordfish, and some other tools that are designed with cascading. So you should make sure that option is set after you have open the file.

==SRX and Java==

The SRX standard uses ICU regular expressions, however it is very difficult to implement the same set of expression using Java and some other programing languages.

See more details in the [[SRX and Java]] section.

==SRX in the Okapi Framework==

The Okapi framework uses SRX in many places. For example:

* [[Ratel]] is an application to edit SRX rules in WYSIWYG mode.
* Steps like the [[Segmentation Step]], the [[Sentence Alignment Step]] or the [[Batch Translation Step]] use SRX rules.

'''Note that the framework implements a [[SRX Extensions|few extensions to SRX]].'''

==Hint: Knowing when a no-break rule will match==

Like any regular expression, a no-break rule matches a number of characters in a string. For example, the pattern "\s+" matches all whitespaces within the string: <pre>" "</pre>

Whether a no-break rule takes effect to overwrite a break rule that follows further down in the SRX is defined by

* if it matches the part of the string, that should not break
* AND if the position where it matches THE LAST is the same position or an earlier position than the last position, where the break-rule matches

Therefore, the following SRX will NOT prevent splitting the sentence after "Co."

<pre><?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20"
xmlns:okpsrx="http://okapi.sf.net/srx-extensions"
version="2.0">

<header segmentsubflows="yes" cascade="no">
<formathandle type="start" include="no"></formathandle>
<formathandle type="end" include="yes"></formathandle>
<formathandle type="isolated" include="no"></formathandle>

<okpsrx:options oneSegmentIncludesAll="no"
trimLeadingWhitespaces="yes"
trimTrailingWhitespaces="yes"
useJavaRegex="yes"
useIcu4JBreakRules="no"
treatIsolatedCodesAsWhitespace="no">
</okpsrx:options>

<okpsrx:sample language="de" useMappedRules="yes">
Die Test GmbH + Co. KG mit Sitz in Stuttgart ist cool.
</okpsrx:sample>

<okpsrx:rangeRule></okpsrx:rangeRule>
</header>

<body>
<languagerules>
<languagerule languagerulename="German">

<rule break="no">
<beforebreak>\bCo\.\s</beforebreak>
<afterbreak></afterbreak>
</rule>


<rule break="yes">
<beforebreak>\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>

</languagerule>
</languagerules>

<maprules>
<languagemap languagepattern="(DE|de).*"
languagerulename="German">
</languagemap>
</maprules>
</body>

</srx></pre>

However, this one will prevent the break:

<pre><?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20"
xmlns:okpsrx="http://okapi.sf.net/srx-extensions"
version="2.0">

<header segmentsubflows="yes" cascade="no">
<formathandle type="start" include="no"></formathandle>
<formathandle type="end" include="yes"></formathandle>
<formathandle type="isolated" include="no"></formathandle>

<okpsrx:options oneSegmentIncludesAll="no"
trimLeadingWhitespaces="yes"
trimTrailingWhitespaces="yes"
useJavaRegex="yes"
useIcu4JBreakRules="no"
treatIsolatedCodesAsWhitespace="no">
</okpsrx:options>

<okpsrx:sample language="de" useMappedRules="yes">
Die Test GmbH + Co. KG mit Sitz in Stuttgart ist cool.
</okpsrx:sample>

<okpsrx:rangeRule></okpsrx:rangeRule>
</header>

<body>
<languagerules>
<languagerule languagerulename="German">

<rule break="no">
<beforebreak>\bCo\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>


<rule break="yes">
<beforebreak>\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>

</languagerule>
</languagerules>

<maprules>
<languagemap languagepattern="(DE|de).*"
languagerulename="German">
</languagemap>
</maprules>
</body>

</srx>
</pre>

[[Category:Segmentation]] [[Category:SRX]]

SRX

2026-01-10T08:23:04Z

Translate5Support:

{{Standards Common Menu}}
__TOC__
==Overview==

The SRX (Segmentation Rules eXchange) format is a standard to save segmentation rules in a file so they can be used between different tools.

It originally maintained by the OSCAR special interest group of the Localisation Industry Standards Association (LISA). In March 2011 LISA was closed and its standards moved under Creative Commons license.

The version 2.0 is the latest version of the specification and can be found here: http://www.gala-global.org/oscarStandards/srx/srx20.html.

SRX rules are grouped into named sets that are activated based the code of the language of the text to process. Each rule defines the text parts before and after the inter-segment location, and specifies if the location should be a break or not. The text parts are defined using [[Regular Expressions|regular expressions]].

Example of SRX simple rules:

<pre><?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20" version="2.0">
<header segmentsubflows="yes" cascade="no">
<formathandle type="start" include="no"></formathandle>
<formathandle type="end" include="yes"></formathandle>
<formathandle type="isolated" include="no"></formathandle>
</header>
<body>
<languagerules>
<languagerule languagerulename="default">
<rule break="no">
<beforebreak>([A-Z]\.){2,}</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="yes">
<beforebreak>\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
</languagerule>
</languagerules>
<maprules>
<languagemap languagepattern=".*" languagerulename="default"></languagemap>
</maprules>
</body>
</srx></pre>

In this example, there are two rules.

The second one specifies that when an inter-character location is preceded by a period and followed by a white space, the rule is to break at that position.

The first rule specifies that when an inter-character location is preceded by the patter <code>([A-Z]\.){2,}</code> and followed by a white paces, the rule is to not break at that position. Because the first rule is placed before the second rule it takes precedence.

So, based on those rules, the following text:

I'm in the U.K. for now. But I plan to move to Papua New Guinea.

will break down into two segments:

[I'm in the U.K. for now.]
[ But I plan to move to Papua New Guinea.]

If the first rule was not there, it would break down into three segments:

[I'm in the U.K.]
[ for now.]
[ But I plan to move to Papua New Guinea.]

==SRX Versions Issue==

There are two versions of SRX: 1.0 and 2.0.

SRX version 1.0 has been implemented by several tools that interpreted how to process the SRX rules in different ways. As a result the same SRX 1.0 document used on different tools may give you different segmentation.

To resolve this issue, an updated version 2.0 specification has been published and provides better implementation guidelines. So, in theory, the same version 2.0 document should give you the same segmentation in all tools.

You can find the specifications of SRX on the LISA web site:

* SRX 1.0: http://www.gala-global.org/oscarStandards/srx/srx10.html
* SRX 2.0: http://www.gala-global.org/oscarStandards/srx/srx20.html

===Implementation Differences for SRX 1.0===

There are two main types of implementations of SRX 1.0: the intended one, and one that use a cascading matching of the language maps.

Tools like SDLX implemented the intended SRX 1.0 behavior (non-cascading). Others, like Swordfish implemented SRX 1.0 with a cascading behavior.

In an SRX document, the segmentation rules are grouped into several <code><languagerule></code> elements. This way you can define different sets of rules that you apply for different languages. The select of which group of rules is to use for a given language is driven by a table defined in the <maprules> element. Each entry in <code><maprules></code> is a <code><languagemap></code>. This entry has two information: a regular expression pattern that corresponds to what language code should use the entry, and a pointer to the group of rules for this entry.

<pre><languagerules>
<languagerule languagerulename='default'>
</languagerule>
<languagerule languagerulename='japanese'>
</languagerule>
<languagerules>

<maprules>
<languagemap languagepattern='ja.*' languagerulename='japanese'/>
<languagemap languagepattern='.*' languagerulename='default'/>
</maprules></pre>

The difference between the SRX 1.0 implementations is how they lookup the <code><maprules></code> for a given language code.

# Some will use only the first <code><languagemap></code> that has a languagepattern matching the language code.
# Other will use all <code><languagemap></code> that have a <code>languagepattern</code> matching the language code.

The first interpretation is the correct one: In SRX 1.0 you use the only first <code><languagemap></code> that matches the given language code.

It is true that there is nothing in the SRX 1.0 specification that says explicitly it should work that way. But there is also nothing explicitly (or implicitly) that says all matching <code><languagemap></code> should be used.

The clue to the intended behavior is in the example of the SRX 1.0 specification:

<languagerules>
<languagerule languagerulename="Default">
 <rule break="no">
<beforebreak>^\s*[0-9]+\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
 <rule break="no">
<beforebreak>[Ee][Tt][Cc]\.</beforebreak>
<afterbreak>\s[a-z]</afterbreak>
</rule>
...
</languagerule>
<languagerule languagerulename="Japanese">
 <rule break="no">
<beforebreak>^\s*[0-9]+\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
 <rule break="no">
<beforebreak>[Ee][Tt][Cc]\.</beforebreak>
<afterbreak></afterbreak>
</rule>
<rule break="yes">
<beforebreak>[\xff61\x3002\xff0e\xff1f\xff01]+</beforebreak>
<afterbreak></afterbreak>
</rule>
...
</languagerule>
</languagerules>

<maprules>
<maprule maprulename="Default">
<languagemap languagepattern="JA.*" languagerulename="Japanese"/>
<languagemap languagepattern=".*" languagerulename="Default"/>
</maprule>
</maprules>

In this example, there is the same rules defined in both the Default and the Japanese groups. If SRX 1.0 intended to use all the <code><languagemap></code> elements that match the given language code, there would be no point to have duplicated rules in Japanese. The Japanese group would have only the extra Japanese-specific rules.

===How to Convert From SRX 1.0 to SRX 2.0?===

The SRX 2.0 specification resolve the cascading issue by making it an option.

When loading or importing SRX 1.0 documents into an SRX 2.0 editor, you must be careful about setting properly the cascade option depending on the provenance of the document.

* SRX 1.0 rules coming from Trados, SDLX and some other tools that implement the normal SRX 1.0 behavior (no cascading). So you should make sure that option is not set after you open the file.

* SRX 1.0 rules coming from Heartsome, Swordfish, and some other tools that are designed with cascading. So you should make sure that option is set after you have open the file.

==SRX and Java==

The SRX standard uses ICU regular expressions, however it is very difficult to implement the same set of expression using Java and some other programing languages.

See more details in the [[SRX and Java]] section.

==SRX in the Okapi Framework==

The Okapi framework uses SRX in many places. For example:

* [[Ratel]] is an application to edit SRX rules in WYSIWYG mode.
* Steps like the [[Segmentation Step]], the [[Sentence Alignment Step]] or the [[Batch Translation Step]] use SRX rules.

'''Note that the framework implements a [[SRX Extensions|few extensions to SRX]].'''

==Hint: Knowing when a no-break rule will match==

Like any regular expression, a no-break rule matches a number of characters in a string. For example, the pattern "\s+" matches all whitespaces within the string <pre>" "</pre>.

Whether a no-break rule takes effect to overwrite a break rule that follows further down in the SRX is defined by

* if it matches the part of the string, that should not break
* AND if the position where it matches THE LAST is the same position or an earlier position than the last position, where the break-rule matches

Therefore, the following SRX will NOT prevent splitting the sentence after "Co."

<pre><?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20"
xmlns:okpsrx="http://okapi.sf.net/srx-extensions"
version="2.0">

<header segmentsubflows="yes" cascade="no">
<formathandle type="start" include="no"></formathandle>
<formathandle type="end" include="yes"></formathandle>
<formathandle type="isolated" include="no"></formathandle>

<okpsrx:options oneSegmentIncludesAll="no"
trimLeadingWhitespaces="yes"
trimTrailingWhitespaces="yes"
useJavaRegex="yes"
useIcu4JBreakRules="no"
treatIsolatedCodesAsWhitespace="no">
</okpsrx:options>

<okpsrx:sample language="de" useMappedRules="yes">
Die Test GmbH + Co. KG mit Sitz in Stuttgart ist cool.
</okpsrx:sample>

<okpsrx:rangeRule></okpsrx:rangeRule>
</header>

<body>
<languagerules>
<languagerule languagerulename="German">

<rule break="no">
<beforebreak>\bCo\.\s</beforebreak>
<afterbreak></afterbreak>
</rule>


<rule break="yes">
<beforebreak>\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>

</languagerule>
</languagerules>

<maprules>
<languagemap languagepattern="(DE|de).*"
languagerulename="German">
</languagemap>
</maprules>
</body>

</srx></pre>

However, this one will prevent the break:

<pre><?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20"
xmlns:okpsrx="http://okapi.sf.net/srx-extensions"
version="2.0">

<header segmentsubflows="yes" cascade="no">
<formathandle type="start" include="no"></formathandle>
<formathandle type="end" include="yes"></formathandle>
<formathandle type="isolated" include="no"></formathandle>

<okpsrx:options oneSegmentIncludesAll="no"
trimLeadingWhitespaces="yes"
trimTrailingWhitespaces="yes"
useJavaRegex="yes"
useIcu4JBreakRules="no"
treatIsolatedCodesAsWhitespace="no">
</okpsrx:options>

<okpsrx:sample language="de" useMappedRules="yes">
Die Test GmbH + Co. KG mit Sitz in Stuttgart ist cool.
</okpsrx:sample>

<okpsrx:rangeRule></okpsrx:rangeRule>
</header>

<body>
<languagerules>
<languagerule languagerulename="German">

<rule break="no">
<beforebreak>\bCo\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>


<rule break="yes">
<beforebreak>\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>

</languagerule>
</languagerules>

<maprules>
<languagemap languagepattern="(DE|de).*"
languagerulename="German">
</languagemap>
</maprules>
</body>

</srx>
</pre>

[[Category:Segmentation]] [[Category:SRX]]

SRX

2026-01-10T08:17:00Z

Translate5Support:

{{Standards Common Menu}}
__TOC__
==Overview==

The SRX (Segmentation Rules eXchange) format is a standard to save segmentation rules in a file so they can be used between different tools.

It originally maintained by the OSCAR special interest group of the Localisation Industry Standards Association (LISA). In March 2011 LISA was closed and its standards moved under Creative Commons license.

The version 2.0 is the latest version of the specification and can be found here: http://www.gala-global.org/oscarStandards/srx/srx20.html.

SRX rules are grouped into named sets that are activated based the code of the language of the text to process. Each rule defines the text parts before and after the inter-segment location, and specifies if the location should be a break or not. The text parts are defined using [[Regular Expressions|regular expressions]].

Example of SRX simple rules:

<pre><?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20" version="2.0">
<header segmentsubflows="yes" cascade="no">
<formathandle type="start" include="no"></formathandle>
<formathandle type="end" include="yes"></formathandle>
<formathandle type="isolated" include="no"></formathandle>
</header>
<body>
<languagerules>
<languagerule languagerulename="default">
<rule break="no">
<beforebreak>([A-Z]\.){2,}</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
<rule break="yes">
<beforebreak>\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
</languagerule>
</languagerules>
<maprules>
<languagemap languagepattern=".*" languagerulename="default"></languagemap>
</maprules>
</body>
</srx></pre>

In this example, there are two rules.

The second one specifies that when an inter-character location is preceded by a period and followed by a white space, the rule is to break at that position.

The first rule specifies that when an inter-character location is preceded by the patter <code>([A-Z]\.){2,}</code> and followed by a white paces, the rule is to not break at that position. Because the first rule is placed before the second rule it takes precedence.

So, based on those rules, the following text:

I'm in the U.K. for now. But I plan to move to Papua New Guinea.

will break down into two segments:

[I'm in the U.K. for now.]
[ But I plan to move to Papua New Guinea.]

If the first rule was not there, it would break down into three segments:

[I'm in the U.K.]
[ for now.]
[ But I plan to move to Papua New Guinea.]

==SRX Versions Issue==

There are two versions of SRX: 1.0 and 2.0.

SRX version 1.0 has been implemented by several tools that interpreted how to process the SRX rules in different ways. As a result the same SRX 1.0 document used on different tools may give you different segmentation.

To resolve this issue, an updated version 2.0 specification has been published and provides better implementation guidelines. So, in theory, the same version 2.0 document should give you the same segmentation in all tools.

You can find the specifications of SRX on the LISA web site:

* SRX 1.0: http://www.gala-global.org/oscarStandards/srx/srx10.html
* SRX 2.0: http://www.gala-global.org/oscarStandards/srx/srx20.html

===Implementation Differences for SRX 1.0===

There are two main types of implementations of SRX 1.0: the intended one, and one that use a cascading matching of the language maps.

Tools like SDLX implemented the intended SRX 1.0 behavior (non-cascading). Others, like Swordfish implemented SRX 1.0 with a cascading behavior.

In an SRX document, the segmentation rules are grouped into several <code><languagerule></code> elements. This way you can define different sets of rules that you apply for different languages. The select of which group of rules is to use for a given language is driven by a table defined in the <maprules> element. Each entry in <code><maprules></code> is a <code><languagemap></code>. This entry has two information: a regular expression pattern that corresponds to what language code should use the entry, and a pointer to the group of rules for this entry.

<pre><languagerules>
<languagerule languagerulename='default'>
</languagerule>
<languagerule languagerulename='japanese'>
</languagerule>
<languagerules>

<maprules>
<languagemap languagepattern='ja.*' languagerulename='japanese'/>
<languagemap languagepattern='.*' languagerulename='default'/>
</maprules></pre>

The difference between the SRX 1.0 implementations is how they lookup the <code><maprules></code> for a given language code.

# Some will use only the first <code><languagemap></code> that has a languagepattern matching the language code.
# Other will use all <code><languagemap></code> that have a <code>languagepattern</code> matching the language code.

The first interpretation is the correct one: In SRX 1.0 you use the only first <code><languagemap></code> that matches the given language code.

It is true that there is nothing in the SRX 1.0 specification that says explicitly it should work that way. But there is also nothing explicitly (or implicitly) that says all matching <code><languagemap></code> should be used.

The clue to the intended behavior is in the example of the SRX 1.0 specification:

<languagerules>
<languagerule languagerulename="Default">
 <rule break="no">
<beforebreak>^\s*[0-9]+\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
 <rule break="no">
<beforebreak>[Ee][Tt][Cc]\.</beforebreak>
<afterbreak>\s[a-z]</afterbreak>
</rule>
...
</languagerule>
<languagerule languagerulename="Japanese">
 <rule break="no">
<beforebreak>^\s*[0-9]+\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>
 <rule break="no">
<beforebreak>[Ee][Tt][Cc]\.</beforebreak>
<afterbreak></afterbreak>
</rule>
<rule break="yes">
<beforebreak>[\xff61\x3002\xff0e\xff1f\xff01]+</beforebreak>
<afterbreak></afterbreak>
</rule>
...
</languagerule>
</languagerules>

<maprules>
<maprule maprulename="Default">
<languagemap languagepattern="JA.*" languagerulename="Japanese"/>
<languagemap languagepattern=".*" languagerulename="Default"/>
</maprule>
</maprules>

In this example, there is the same rules defined in both the Default and the Japanese groups. If SRX 1.0 intended to use all the <code><languagemap></code> elements that match the given language code, there would be no point to have duplicated rules in Japanese. The Japanese group would have only the extra Japanese-specific rules.

===How to Convert From SRX 1.0 to SRX 2.0?===

The SRX 2.0 specification resolve the cascading issue by making it an option.

When loading or importing SRX 1.0 documents into an SRX 2.0 editor, you must be careful about setting properly the cascade option depending on the provenance of the document.

* SRX 1.0 rules coming from Trados, SDLX and some other tools that implement the normal SRX 1.0 behavior (no cascading). So you should make sure that option is not set after you open the file.

* SRX 1.0 rules coming from Heartsome, Swordfish, and some other tools that are designed with cascading. So you should make sure that option is set after you have open the file.

==SRX and Java==

The SRX standard uses ICU regular expressions, however it is very difficult to implement the same set of expression using Java and some other programing languages.

See more details in the [[SRX and Java]] section.

==SRX in the Okapi Framework==

The Okapi framework uses SRX in many places. For example:

* [[Ratel]] is an application to edit SRX rules in WYSIWYG mode.
* Steps like the [[Segmentation Step]], the [[Sentence Alignment Step]] or the [[Batch Translation Step]] use SRX rules.

'''Note that the framework implements a [[SRX Extensions|few extensions to SRX]].'''

==Hint: When a no-break rule will match and when not==

A no-break rule can (as any regex) matches a number of characters in a string. Like \s+ does match all whitespaces within this string " ".

If a no-break rule takes effect to overwrite a break-rule that follows further down in the srx is defined by

* if it matches the part of the string, that should not break
* AND if the position where it matches THE LAST is the same position or an earlier position than the last position, where the break-rule matches

Therefore the following srx will NOT lead to prevent the splitting of the sentence after "Co."

<pre><?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20"
xmlns:okpsrx="http://okapi.sf.net/srx-extensions"
version="2.0">

<header segmentsubflows="yes" cascade="no">
<formathandle type="start" include="no"></formathandle>
<formathandle type="end" include="yes"></formathandle>
<formathandle type="isolated" include="no"></formathandle>

<okpsrx:options oneSegmentIncludesAll="no"
trimLeadingWhitespaces="yes"
trimTrailingWhitespaces="yes"
useJavaRegex="yes"
useIcu4JBreakRules="no"
treatIsolatedCodesAsWhitespace="no">
</okpsrx:options>

<okpsrx:sample language="de" useMappedRules="yes">
Die Test GmbH + Co. KG mit Sitz in Stuttgart ist cool.
</okpsrx:sample>

<okpsrx:rangeRule></okpsrx:rangeRule>
</header>

<body>
<languagerules>
<languagerule languagerulename="German">

<rule break="no">
<beforebreak>\bCo\.\s</beforebreak>
<afterbreak></afterbreak>
</rule>


<rule break="yes">
<beforebreak>\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>

</languagerule>
</languagerules>

<maprules>
<languagemap languagepattern="(DE|de).*"
languagerulename="German">
</languagemap>
</maprules>
</body>

</srx></pre>

But this one will prevent the break:

<pre><?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20"
xmlns:okpsrx="http://okapi.sf.net/srx-extensions"
version="2.0">

<header segmentsubflows="yes" cascade="no">
<formathandle type="start" include="no"></formathandle>
<formathandle type="end" include="yes"></formathandle>
<formathandle type="isolated" include="no"></formathandle>

<okpsrx:options oneSegmentIncludesAll="no"
trimLeadingWhitespaces="yes"
trimTrailingWhitespaces="yes"
useJavaRegex="yes"
useIcu4JBreakRules="no"
treatIsolatedCodesAsWhitespace="no">
</okpsrx:options>

<okpsrx:sample language="de" useMappedRules="yes">
Die Test GmbH + Co. KG mit Sitz in Stuttgart ist cool.
</okpsrx:sample>

<okpsrx:rangeRule></okpsrx:rangeRule>
</header>

<body>
<languagerules>
<languagerule languagerulename="German">

<rule break="no">
<beforebreak>\bCo\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>


<rule break="yes">
<beforebreak>\.</beforebreak>
<afterbreak>\s</afterbreak>
</rule>

</languagerule>
</languagerules>

<maprules>
<languagemap languagepattern="(DE|de).*"
languagerulename="German">
</languagemap>
</maprules>
</body>

</srx>
</pre>
while both no-break-rules match the relevant part "Co." in the sentence.

[[Category:Segmentation]] [[Category:SRX]]

JSON Filter

2025-12-18T17:37:16Z

Translate5Support:

{{Filters Header}}
==Overview==

The JSON Filter is an Okapi component that implements the IFilter interface for JSON (Javascript Object Notation).

The implementation is based on the JSON specifications: http://www.json.org/

The following is an example of a very simple JSON file. The translatable text is highlighted:

{"menu": {
"value": "File",
"popup": {
"menuitem": [
{"value": "New"},
{"value": "Open"},
{"value": "Close"}
]
}
}}

==Processing Details==

===Input Encoding===

JSON files are normally in one of the Unicode encoding, but the filter supports any encoding. It decides which encoding to use for the input file using the following logic:

* If the file has a Unicode Byte-Order-Mark:
** Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
* Else, if a header entry with a <code>charset</code> declaration exists in the first 1000 characters of the file:
** If the value of the <code>charset</code> is "<code>charset</code>" (case insensitive):
*** Then the file is likely to be a template with no encoding declared, so the current encoding (auto-detected or default) is used.
*** Else, the declared encoding is used. Note that if the encoding has been detected from a Byte-Order-Mark and the encoding declared in the header entry does not match, a warning is generated and the encoding of the Byte-Order-Mark is used.
* Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.

===Output Encoding===

If the output encoding is UTF-8:

* If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
* If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

===Line-Breaks===

The type of line-breaks of the output is the same as the one of the original input.

===Comments===

Though not technically legal in JSON these comment types are supported:
<code>
* // comment
* # comment
* /* comment */
* 
</code>

==Parameters==

=== Options Tab===

====Stand-alone strings====

<cite>Extract strings without associated key</cite> — Set this option to extract string that are not associated directly to a key value.

====Strings with keys====

<cite>Extract all key/strings pairs</cite> — Set this option to extract all strings that have a key associated. If a regular expression for exceptions is defined, the strings that have a key matching the expression are not extracted.

<cite>Do not extract key/string pairs</cite> — Set the option to not extract any string that has an associated key. If a regular expression for exceptions is defined, the strings that have a key matching the expression are extracted.

<cite>Excepted when the key matches the following regular expression</cite> — Enter a regular expression that correspond to the keys that should have a behavior inverse to the default behavior you have selected for the key/strings pairs.
For example, you could exclude a key-value with <code>key</code>.
In combination with <code>Use the full key path</code> you can exclude all nested elements in a JSON structure with <code>^.*?/excludedStructure/.*</code>

<cite>Use the key as the resname</cite> — Set this option to use the value of the key as the value of the name of the extracted item (<code>resname</code> in XLIFF).

<cite>Use the full key path</cite> — Set this option to use the full key path in the <code>resname</code>. For example: <code>/menu/value/popup/menuitem/value</code>. The use key name as resname option must be set for this option to take effect. If enabled, exception regular expressions apply to the full path.

<cite>Include leading "/" on key path</cite> — Set this option to have a leading character '/' in the full key path.

<cite>Regex matching keys that are notes, values of which to appear as <note> in XLIFF</cite> — Specify regular expression. The values of the matching keys will be transferred to <note> elements in XLIFF.

<cite>Regex matching keys who's values are added as TextUnit Metadata</cite> — Specify regular expression. The values of the matching keys will be written out as <context-group> elements in XLIFF.

===New Extraction Rules >= version M39===
If specified these will override the corresponding rules above.

<cite>Regex matching keys who's values are extracted (overrides extraction exceptions)</cite>

<cite>Regex matching keys that are notes, values of which to appear as <note> in XLIFF</cite>

<cite>Regex matching keys which are ID's (resname in XLIFF), overrides "use key as resname"</cite>

Hint: If you have the following json, that contains the actual key in the value of a neighboring key/value pair
<pre>
[
{
"key": "datePicker_marchMonth",
"text": "March"
},
{
"key": "datePicker_aprilMonth",
"text": "April"
}
]
</pre>
and define simply the regex "key" in this configuration option, you would get the following xliff extracted
<pre>
<trans-unit id="tu1" resname="datePicker_marchMonth" xml:space="preserve">
<source xml:lang="en-US">March</source>
<target xml:lang="de-DE"></target>
</trans-unit>
<trans-unit id="tu2" resname="datePicker_aprilMonth" xml:space="preserve">
<source xml:lang="en-US">April</source>
<target xml:lang="de-DE"></target>
</trans-unit>
</pre>

<cite>Regex matching keys who's values are added as TextUnit Metadata</cite>

<cite>Regex matching keys that are numbers, values of which will be extracted as maxwidth property in XLIFF</cite>

<ul>
<li>If specified, its extracted value is used as maxwidth of all other elements of the array on that level.</li>
<li>There is only one matching array element for the regex allowed on each hierarchy level of the regex.</li>
<li>If there are nested array levels, for all parent-child levels and also different sibling levels different maxwidth values can be defined.</li>
<li>If there are different values defined, still the key of all definitions can be the same.</li>
<li>If on a sublevel no key matches the regex within the current level, but a key on a higher level does, the definition on the higher level determines the max length of the deepest hierarchy level (and only this one - not higher levels) and its siblings without matching key.</li>
<li>If on a higher level a key matches and on all lower levels also keys match, than for all elements of the corresponding levels values are extrated. BUT: For the higher level(s) the matching key must be defined after the last child element.</li>
</ul>

<cite>The size unit property to use when maxwidth poperties are extracted</cite>
The string that is entered here is used as value for the size-unit attribute of the trans-unit in xliff with length restriction.

====Example FPRM Settings:====
Regex rules apply to key names.

'''
extraction rules (use instead of rule exceptions):
extractionRules=/widgets/body.*

note rules (add values to TextUnits as notes):
noteRules=/widgets/name.*

id rules (overrides useKeyAsName):
idRules=/widgets/id.*

generic metadata (matched key:values are added as metadata to TextUnit):
genericMetaRules=/widgets/image.*'''

===Content Processing Tab===

<cite>Process text content with this sub-filter</cite> — Specify an Okapi filter ID (e.g. <code>okf_html</code>) to process the content of all translatable text with that filter. Leave this field blank for default behavior.

<cite>Find inline codes by patterns defined below</cite> — Set this option to use the specified regular expressions on the text of the extracted items. Any match will be converted to an inline code.

'''Note:''' This option cannot be used together with the sub-filtering option.

By default the expression is:

((%(([-0+#]?)[-0+#]?)((\d\$)?)(([\d\*]*)(\.[\d\*]*)?)[dioxXucsfeEgGpn])
|((\\r\\n)|\\a|\\b|\\f|\\n|\\r|\\t|\\v)
|(\{\d.*?\}))

{{CodeFinder Help}}

==Limitations==

Comments within a JSON string are parsed as part of the string content, not as comments. A configured subfilter will then process these as true comments (they will become part of the skeleton or whatever the filter is configured to do).
[[Category:Filters]]

XML Filter

2024-03-28T21:23:47Z

Translate5Support: /* Filter Options */

{{Filters Header}}
==Overview==

This filter allows you to process XML documents. It uses a DOM-based parser, which allows it to implement [[ITS]]. If you need to process very large XML documents and have no need for ITS, you may want to look at using the [[XML Stream Filter]].

The following is an example of a simple XML document. The translatable text is highlighted. Because each format based on XML is different, you need information on what are the translatable parts, what are the inline elements, etc. The XML Filter [[#ITS Support|implements the ITS W3C Recommendation]] to address this issue.

<?xml version="1.0" encoding="utf-8"?>
<myDoc>
<prolog>
<author>Zebulon Fairfield</author>
<version>version 12, revision 2 - 2006-08-14</version>
<keywords><kw>horse</kw><kw>appaloosa</kw></keywords>
<storageKey>articles-6D272BA9-3B89CAD8</storageKey>
</prolog>
<body>
<title>Appaloosa</title>
The Appaloosas are rugged horses originally breed by
the <kw>Nez-Perce</kw> tribe in the US Northwest.
They are often characterized by their spotted coats.
</body>
</myDoc>

This filter is implemented in the class <code>net.sf.okapi.filters.xml.XMLFilter</code> of the library.

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input document using the following logic:

* If the document has an encoding declaration it is used.
* Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).

===Output Encoding===

If the output encoding is UTF-8:

* If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
* If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

If the original document had an XML encoding declaration it is updated, if it did not, one is automatically added.

===Line-Breaks===

The type of line-breaks of the output is the same as the one of the original input.

==Parameters==

This filter stores its parameters in an XML file and does not provide an editor to modify it. You can edit the file in a simple text editor, or with an XML editor. For an example, see the article "[[How to Create a Custom Configuration for the XML Filter]]".

===ITS Support===

By default the filter process the XML documents based on the '''ITS defaults'''. That is:

* the content of all elements is translatable,
* and none of the values of the attribute translatable.

Different behavior can occur if the input document contains ITS markup, or if a filter parameters file is specified. The parameters file used by the the XML Filter is [[ITS|an ITS document]].

The '''Internationalization Tag set (ITS)''' is a W3C recommendation that defines a set of elements and attributes you can use to specify different internationalization- and localization-related aspects of your XML document, for instance: ITS defines what attribute values are translatable, what element content should be protected, what element should be treated as a nested sub-flow of text, and much more.

The filter supports ITS 1.0 and ITS 2.0 (2.0 is backward compatible with 1.0)

* The ITS 1.0 specification is available at http://www.w3.org/TR/its/.
* The ITS 2.0 specification is available at http://www.w3.org/TR/its20/.

See the "[[ITS]]" page for more details on the format.

The filter supports global and local rules and most data categories. See the '''[[ITS Components]]''' page for a detailed list of how the data categories are supported and other information on the implementation.

===ITS Extensions===

The filter supports extensions to the ITS specification. These extension use the namespace URI http://www.w3.org/2008/12/its-extensions.

* [[#idValue and xml:id|idValue and xml:id]]
* [[#whiteSpaces|whiteSpaces]]

====idValue and xml:id====

{{NoteBox|This extension was defined for ITS 1.0, ITS 2.0 offers the new [http://www.w3.org/TR/its20/#idvalue Id Value] data category that should be used instead of this extension.}}

When the attribute <code>xml:id</code> is found on a translatable element, it is used as the name of the text unit generated for that element.

For example, in the example below, the resource name associated with the text unit for the <code></code> element is "<code>id1</code>".

Text

The attribute <code>idValue</code> used in the ITS <code>translateRule</code> element allows you to define an XPath expression that correspeonds to the identifier value for the given selection. The value of <code>idValue</code> must be an expression that can return a string. A node location is a valid expression: it will return the value of the first node at the given location.

For example, in the example below, the resource name associated with the text unit for the <code></code> element is "<code>id1</code>":

<pre><doc>
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
<its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/>
</its:rules>
text 1
</doc></pre>

Note that <code>xml:id</code> has precedence over <code>idValue</code> declaration. For example, in the example below, the resource name associated with the text unit for the <code></code> element is "<code>xid1</code>", not "<code>id1</code>".

<pre><doc>
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
<its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/>
</its:rules>
text 1
</doc></pre>

You can build complex ID based on different attributes, element or event hard-coded text. Any of the String functions offered by XPath can be used.

For example, in the file below, the two elements <code>&tl;text></code> and <code><desc></code> are translatable, but they have only one corresponding ID, the <code>name</code> attribute in their parent element. To make sure you have a unique identifier for both the content of <code><text></code> and the content of <code><desc></code>, you can use the rules set in the example. The XPath expression "<code>concat(../@name, '_t')</code>" will give the ID "<code>id1_t</code>" and the expression "<code>concat(../@name, '_d')</code>" will give the ID "<code>id1_d</code>".

<pre><doc>
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
<its:translateRule selector="//text" translate="yes" itsx:idValue="concat(../@name, '_t')"/>
<its:translateRule selector="//desc" translate="yes" itsx:idValue="concat(../@name, '_d')"/>
</its:rules>
<msg name="id1">
<text>Value of text</text>
<desc>Value of desc</desc>
</msg>
</doc></pre>

====whiteSpaces====

{{NoteBox|This extension was defined for ITS 1.0, ITS 2.0 offers the new [http://www.w3.org/TR/its20/#preservespace Preserve Space] data category that should be used instead of this extension.}}

The extension attribute whiteSpaces allows you to apply globally the equivalent of a local <code>xml:space</code> attribute.

For example, if you have a format where all element <code><pre></code> must have their spaces, tabs and line breaks preserved, you can specify the attribute <code>whiteSpaces="preserve"</code> in a <code><its:translateRule></code> element for the <code><pre></code> elements. In the example below, the spaces in the <code><pre></code> element will be preserved on extraction.

<doc>
<nowiki><its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions"></nowiki>
<its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/>
</its:rules>
<pre>Some txt with many spaces. </pre>
</doc>

Note that the <code>xml:space</code> attribute has precedence over <code>whiteSpaces</code>. For example, in the following example, the white spaces in the content of <code><pre></code> may '''not''' be preserved because the attribute <code>xml:space</code> has the value <code>default</code>:

<doc>
<nowiki><its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions"></nowiki>
<its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/>
</its:rules>
&<pre xml:space="default">Some txt with many spaces. </pre>
</doc>

===Filter Options===

The filter supports also options in addition to ITS and ITS extension. These options use the namespace URI <code>okapi-framework:xmlfilter-options</code>.

{{NoteBox|The filter options must be placed in the parameters file (.fprm) used with the filter, not in embedded or linked ITS rules. Options placed in embedded or linked ITS rules have no effect.}}

When you use several options, they must be set in a single <code><okp:options></code> element, as shown below:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options lineBreakAsCode="yes"
escapeQuotes="no"
escapeGT="yes"
/>
</its:rules></pre>

The following options are available:

* [[#lineBreakAsCode|lineBreakAsCode]]
* [[#codeFinder|codeFinder]]
* [[#omitXMLDeclaration|omitXMLDeclaration]]
* [[#escapeQuotes|escapeQuotes]]
* [[#escapeGT|escapeGT]]
* [[#escapeNbsp|escapeNbsp]]
* [[#extractIfOnlyCodes|extractIfOnlyCodes]]
* [[#inlineCdata|inlineCdata]]
* [[#extractUntranslatable|extractUntranslatable]]

====lineBreakAsCode====

In some cases the content of element includes line-breaks that need to be included as part of the content but without using an actual line-break in the extracted text. For example in some XML documents generated by Excel, the formatting of the cells is marked up with <code>&#10;</code> entity references. They need to be passed as inline codes.

By default this option is set to false.

To specify this the filter use the extension <code>lineBreakAsCode</code> extension attribute. This affect all the extracted content.

For example: The following code is an ITS document with the option to treat line-breaks as code. It can be used along with the example of XML document listed below.

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options lineBreakAsCode="yes"/>
</its:rules></pre>

<doc>
<data>line 1&#10;line 2.</data>
</doc>

====codeFinder====

You can define a set of regular expressions to capture span of extracted text that should be treated as inline codes. For example, some element content may have variables, or HTML tags that need to be protected from modification and treated as codes. Use the codeFinder element for this.

In the following parameters file, the <code>codeFinder</code> element defines two rules:

* The first one (rule0) is "<code><(/?)\w[^>]*?></code>" and matches any XML-type tags (e.g. "<code></code>", "<code></code>", "<code> </code>")
* The second one (rule1) is "<code>(#\w+?\#)|(%\d+?%)</code>" and matches any word enclosed in <code>#</code> (e.g. "<code>#VAR#</code>") or number enclosed in <code>%</code> (e.g. "<code>%1%</code>").

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:codeFinder useCodeFinder="yes">#v1
count.i=2
rule0=&lt;(/?)\w+[^&gt;]*?&gt;
rule1=(#\w+?\#)|(%\d+?%)
</okp:codeFinder>
</its:rules></pre>

Some important details:

* Set <code>useCodeFinder</code> to "yes" to have the rules used, if the attribute is missing its value is assumed to be "no".
* Make sure the first line of the <code><codeFinder></code> element content is <code>#v1</code>.
* Each entry in the content must be on a separate line.
* <code>count.i=N</code> must be before any rules and <code>N</code> must be the number of rules.
* <code>ruleN</code> must be incremented starting at 0.
* The pattern for a rule must be escaped for XML, for example: "<code><(/?)\w[^>]*?></code>" must be entered "<code>&lt;(/?)\w[^&lt;]*?&gt;</code>" in the parameters file.
* Do not put spaces before <code>count.i</code> or <code>ruleN</code>, and not after your expressions.

To facilitate the creation of code finder rules [[Rainbow - Code Finder Editor|Rainbow provides the Code Finder Editor]].

====omitXMLDeclaration====

By default an XML declaration is always set at the top of the output document (regardless wether the original document has one or not). It is an important part of the XML document and it is especially needed when the encoding of the output document is not UTF-8, UTF-16 or UTF-32, as its name must be specified in the XML declaration. However, there are a few special cases when the declaration is better left off. To handle those rare cases, you can use <code>omitXMLDeclation</code> to indicate the filter to not output the XML declaration.

For example:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options omitXMLDeclaration="yes"/>
</its:rules></pre>

Remember that XML documents without an XML declaration may be read incorrectly if the encoding of the document is not UTF-8, UTF-16 or UTF-32.

====escapeQuotes====

By default, when processing the document, the filter uses double-quotes to enclose all attributes (translatable or not) and use the following rules for escaping/not-escaping the literal quotes:

* Inside the attribute values:
** Single-quotes (=apostrophes) are never escaped
** Double-quotes are always escaped
* In element content:
** Single-quotes (=apostrophes) are not escaped
** Double-quotes are escaped by default

You cannot change the escaping rules for attributes.

For element content: If the document is processed without triggering any rule that allow the translation of an attribute, then (and only then) the filter takes into account the <code>escapeQuotes</code> option to escape or not double-quotes in the translatable content.

For example, the following parameters file allows to not escape double-quotes in element content (for the documents where there is no rule for translatable attributes are triggered):

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options escapeQuotes="no"/>
</its:rules></pre>

====escapeGT====

By default the character '<code>></code>' is escaped. You can indicate to the filter to not escape it using the <code>escapeGT</code> option.

For example, the following parameters file indicates to not escape greater-than characters:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options escapeGT="no"/>
</its:rules></pre>

====escapeNbsp====

By default the non-breaking space character is escaped (in the form <code>&#x00a0;</code>). You can indicate to the filter to not escape it using the <code>escapeNbsp</code> option.

For example, the following parameters file indicates to not escape the non-breaking space characters:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options escapeNbsp="no"/>
</its:rules></pre>

====extractIfOnlyCodes====

By default all extractable entries are extracted even when they contain only white-spaces and/or inline codes. You can indicate to the filter to not extract such entries using the <code>extractIfOnlyCodes</code> option.

For example, the following parameters file indicates to not extract entries with only whte-spaces and/or inline codes:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options extractIfOnlyCodes="no"/>
</its:rules></pre>

====inlineCdata====

By default, CDATA sections will be exposed as regular content, and the CDATA markers themselves will be discarded. When the <code>inlineCdata</code> option is set,
the CDATA markers will be exposed as inline codes.

For example, the following parameters file will expose CDATA markers as inline codes:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options inlineCdata="yes"/>
</its:rules></pre>

====extractUntranslatable====

All untranslatable entries (<code>its:translate="no"</code>) are not extracted by default. And in order to allow the extraction of such entries for context reasons, the following option has to be used: <code>extractUntranslatable</code>.

Below is an example of this option declaration:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options extractUntranslatable="yes"/>
</its:rules></pre>

With this option contents that are untranslatable will be extracted, but marked as translate="no" in xliff.

Hint: If you want to extract certain untranslatable contents and others not: By default all untranslatable contents are extracted, if extractUntranslatable="yes". To exclude certain contents, you can use the following rule and "misuse" the localeFilterList ITS attribute:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options extractUntranslatable="yes"/>
<its:localeFilterRule selector="//yourTagThatShouldNotBeExtracted" localeFilterList="!*"/>
</its:rules></pre>

==Limitations==

* Currently, in some cases, the ITS rule <code>withinTextRule</code> with the value <code>nested</code> may act like it has a value <code>yes</code> instead.
* In output, the values of the <code>xml:lang</code> attributes are not updated to reflect the target language.
* When doing the extraction, the whole input file is loaded into memory. You may run into memory limitation if the document is very large.

[[Category:Filters]] [[Category:ITS]]

JSON Filter

2024-03-13T16:33:41Z

Translate5Support: /* New Extraction Rules >= version M39 */

{{Filters Header}}
==Overview==

The JSON Filter is an Okapi component that implements the IFilter interface for JSON (Javascript Object Notation).

The implementation is based on the JSON specifications: http://www.json.org/

The following is an example of a very simple JSON file. The translatable text is highlighted:

{"menu": {
"value": "File",
"popup": {
"menuitem": [
{"value": "New"},
{"value": "Open"},
{"value": "Close"}
]
}
}}

==Processing Details==

===Input Encoding===

JSON files are normally in one of the Unicode encoding, but the filter supports any encoding. It decides which encoding to use for the input file using the following logic:

* If the file has a Unicode Byte-Order-Mark:
** Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
* Else, if a header entry with a <code>charset</code> declaration exists in the first 1000 characters of the file:
** If the value of the <code>charset</code> is "<code>charset</code>" (case insensitive):
*** Then the file is likely to be a template with no encoding declared, so the current encoding (auto-detected or default) is used.
*** Else, the declared encoding is used. Note that if the encoding has been detected from a Byte-Order-Mark and the encoding declared in the header entry does not match, a warning is generated and the encoding of the Byte-Order-Mark is used.
* Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.

===Output Encoding===

If the output encoding is UTF-8:

* If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
* If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

===Line-Breaks===

The type of line-breaks of the output is the same as the one of the original input.

===Comments===

Though not technically legal in JSON these comment types are supported:
<code>
* // comment
* # comment
* /* comment */
* 
</code>

==Parameters==

=== Options Tab===

====Stand-alone strings====

<cite>Extract strings without associated key</cite> — Set this option to extract string that are not associated directly to a key value.

====Strings with keys====

<cite>Extract all key/strings pairs</cite> — Set this option to extract all strings that have a key associated. If a regular expression for exceptions is defined, the strings that have a key matching the expression are not extracted.

<cite>Do not extract key/string pairs</cite> — Set the option to not extract any string that has an associated key. If a regular expression for exceptions is defined, the strings that have a key matching the expression are extracted.

<cite>Excepted when the key matches the following regular expression</cite> — Enter a regular expression that correspond to the keys that should have a behavior inverse to the default behavior you have selected for the key/strings pairs.
For example, you could exclude a key-value with <code>key</code>.
In combination with <code>Use the full key path</code> you can exclude all nested elements in a JSON structure with <code>^.*?/excludedStructure/.*</code>

<cite>Use the key as the resname</cite> — Set this option to use the value of the key as the value of the name of the extracted item (<code>resname</code> in XLIFF).

<cite>Use the full key path</cite> — Set this option to use the full key path in the <code>resname</code>. For example: <code>/menu/value/popup/menuitem/value</code>. The use key name as resname option must be set for this option to take effect. If enabled, exception regular expressions apply to the full path.

<cite>Include leading "/" on key path</cite> — Set this option to have a leading character '/' in the full key path.

<cite>Regex matching keys that are notes, values of which to appear as <note> in XLIFF</cite> — Specify regular expression. The values of the matching keys will be transferred to <note> elements in XLIFF.

<cite>Regex matching keys who's values are added as TextUnit Metadata</cite> — Specify regular expression. The values of the matching keys will be written out as <context-group> elements in XLIFF.

===New Extraction Rules >= version M39===
If specified these will override the corresponding rules above.

<cite>Regex matching keys who's values are extracted (overrides extraction exceptions)</cite>

<cite>Regex matching keys that are notes, values of which to appear as <note> in XLIFF</cite>

<cite>Regex matching keys which are ID's (resname in XLIFF), overrides "use key as resname"</cite>

Hint: If you have the following json, that contains the actual key in the value of a neighboring key/value pair
<pre>
[
{
"key": "datePicker_marchMonth",
"text": "March"
},
{
"key": "datePicker_aprilMonth",
"text": "April"
}
]
</pre>
and define simply the regex "key" in this configuration option, you would get the following xliff extracted
<pre>
<trans-unit id="tu1" resname="datePicker_marchMonth" xml:space="preserve">
<source xml:lang="en-US">March</source>
<target xml:lang="de-DE"></target>
</trans-unit>
<trans-unit id="tu2" resname="datePicker_aprilMonth" xml:space="preserve">
<source xml:lang="en-US">April</source>
<target xml:lang="de-DE"></target>
</trans-unit>
</pre>

<cite>Regex matching keys who's values are added as TextUnit Metadata</cite>

====Example FPRM Settings:====
Regex rules apply to key names.

'''
extraction rules (use instead of rule exceptions):
extractionRules=/widgets/body.*

note rules (add values to TextUnits as notes):
noteRules=/widgets/name.*

id rules (overrides useKeyAsName):
idRules=/widgets/id.*

generic metadata (matched key:values are added as metadata to TextUnit):
genericMetaRules=/widgets/image.*'''

===Content Processing Tab===

<cite>Process text content with this sub-filter</cite> — Specify an Okapi filter ID (e.g. <code>okf_html</code>) to process the content of all translatable text with that filter. Leave this field blank for default behavior.

<cite>Find inline codes by patterns defined below</cite> — Set this option to use the specified regular expressions on the text of the extracted items. Any match will be converted to an inline code.

'''Note:''' This option cannot be used together with the sub-filtering option.

By default the expression is:

((%(([-0+#]?)[-0+#]?)((\d\$)?)(([\d\*]*)(\.[\d\*]*)?)[dioxXucsfeEgGpn])
|((\\r\\n)|\\a|\\b|\\f|\\n|\\r|\\t|\\v)
|(\{\d.*?\}))

{{CodeFinder Help}}

==Limitations==

Comments within a JSON string are parsed as part of the string content, not as comments. A configured subfilter will then process these as true comments (they will become part of the skeleton or whatever the filter is configured to do).
[[Category:Filters]]

XML Filter

2024-02-22T21:34:06Z

Translate5Support: /* Filter Options */

{{Filters Header}}
==Overview==

This filter allows you to process XML documents. It uses a DOM-based parser, which allows it to implement [[ITS]]. If you need to process very large XML documents and have no need for ITS, you may want to look at using the [[XML Stream Filter]].

The following is an example of a simple XML document. The translatable text is highlighted. Because each format based on XML is different, you need information on what are the translatable parts, what are the inline elements, etc. The XML Filter [[#ITS Support|implements the ITS W3C Recommendation]] to address this issue.

<?xml version="1.0" encoding="utf-8"?>
<myDoc>
<prolog>
<author>Zebulon Fairfield</author>
<version>version 12, revision 2 - 2006-08-14</version>
<keywords><kw>horse</kw><kw>appaloosa</kw></keywords>
<storageKey>articles-6D272BA9-3B89CAD8</storageKey>
</prolog>
<body>
<title>Appaloosa</title>
The Appaloosas are rugged horses originally breed by
the <kw>Nez-Perce</kw> tribe in the US Northwest.
They are often characterized by their spotted coats.
</body>
</myDoc>

This filter is implemented in the class <code>net.sf.okapi.filters.xml.XMLFilter</code> of the library.

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input document using the following logic:

* If the document has an encoding declaration it is used.
* Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).

===Output Encoding===

If the output encoding is UTF-8:

* If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
* If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

If the original document had an XML encoding declaration it is updated, if it did not, one is automatically added.

===Line-Breaks===

The type of line-breaks of the output is the same as the one of the original input.

==Parameters==

This filter stores its parameters in an XML file and does not provide an editor to modify it. You can edit the file in a simple text editor, or with an XML editor. For an example, see the article "[[How to Create a Custom Configuration for the XML Filter]]".

===ITS Support===

By default the filter process the XML documents based on the '''ITS defaults'''. That is:

* the content of all elements is translatable,
* and none of the values of the attribute translatable.

Different behavior can occur if the input document contains ITS markup, or if a filter parameters file is specified. The parameters file used by the the XML Filter is [[ITS|an ITS document]].

The '''Internationalization Tag set (ITS)''' is a W3C recommendation that defines a set of elements and attributes you can use to specify different internationalization- and localization-related aspects of your XML document, for instance: ITS defines what attribute values are translatable, what element content should be protected, what element should be treated as a nested sub-flow of text, and much more.

The filter supports ITS 1.0 and ITS 2.0 (2.0 is backward compatible with 1.0)

* The ITS 1.0 specification is available at http://www.w3.org/TR/its/.
* The ITS 2.0 specification is available at http://www.w3.org/TR/its20/.

See the "[[ITS]]" page for more details on the format.

The filter supports global and local rules and most data categories. See the '''[[ITS Components]]''' page for a detailed list of how the data categories are supported and other information on the implementation.

===ITS Extensions===

The filter supports extensions to the ITS specification. These extension use the namespace URI http://www.w3.org/2008/12/its-extensions.

* [[#idValue and xml:id|idValue and xml:id]]
* [[#whiteSpaces|whiteSpaces]]

====idValue and xml:id====

{{NoteBox|This extension was defined for ITS 1.0, ITS 2.0 offers the new [http://www.w3.org/TR/its20/#idvalue Id Value] data category that should be used instead of this extension.}}

When the attribute <code>xml:id</code> is found on a translatable element, it is used as the name of the text unit generated for that element.

For example, in the example below, the resource name associated with the text unit for the <code></code> element is "<code>id1</code>".

Text

The attribute <code>idValue</code> used in the ITS <code>translateRule</code> element allows you to define an XPath expression that correspeonds to the identifier value for the given selection. The value of <code>idValue</code> must be an expression that can return a string. A node location is a valid expression: it will return the value of the first node at the given location.

For example, in the example below, the resource name associated with the text unit for the <code></code> element is "<code>id1</code>":

<pre><doc>
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
<its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/>
</its:rules>
text 1
</doc></pre>

Note that <code>xml:id</code> has precedence over <code>idValue</code> declaration. For example, in the example below, the resource name associated with the text unit for the <code></code> element is "<code>xid1</code>", not "<code>id1</code>".

<pre><doc>
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
<its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/>
</its:rules>
text 1
</doc></pre>

You can build complex ID based on different attributes, element or event hard-coded text. Any of the String functions offered by XPath can be used.

For example, in the file below, the two elements <code>&tl;text></code> and <code><desc></code> are translatable, but they have only one corresponding ID, the <code>name</code> attribute in their parent element. To make sure you have a unique identifier for both the content of <code><text></code> and the content of <code><desc></code>, you can use the rules set in the example. The XPath expression "<code>concat(../@name, '_t')</code>" will give the ID "<code>id1_t</code>" and the expression "<code>concat(../@name, '_d')</code>" will give the ID "<code>id1_d</code>".

<pre><doc>
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
<its:translateRule selector="//text" translate="yes" itsx:idValue="concat(../@name, '_t')"/>
<its:translateRule selector="//desc" translate="yes" itsx:idValue="concat(../@name, '_d')"/>
</its:rules>
<msg name="id1">
<text>Value of text</text>
<desc>Value of desc</desc>
</msg>
</doc></pre>

====whiteSpaces====

{{NoteBox|This extension was defined for ITS 1.0, ITS 2.0 offers the new [http://www.w3.org/TR/its20/#preservespace Preserve Space] data category that should be used instead of this extension.}}

The extension attribute whiteSpaces allows you to apply globally the equivalent of a local <code>xml:space</code> attribute.

For example, if you have a format where all element <code><pre></code> must have their spaces, tabs and line breaks preserved, you can specify the attribute <code>whiteSpaces="preserve"</code> in a <code><its:translateRule></code> element for the <code><pre></code> elements. In the example below, the spaces in the <code><pre></code> element will be preserved on extraction.

<doc>
<nowiki><its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions"></nowiki>
<its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/>
</its:rules>
<pre>Some txt with many spaces. </pre>
</doc>

Note that the <code>xml:space</code> attribute has precedence over <code>whiteSpaces</code>. For example, in the following example, the white spaces in the content of <code><pre></code> may '''not''' be preserved because the attribute <code>xml:space</code> has the value <code>default</code>:

<doc>
<nowiki><its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions"></nowiki>
<its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/>
</its:rules>
&<pre xml:space="default">Some txt with many spaces. </pre>
</doc>

===Filter Options===

The filter supports also options in addition to ITS and ITS extension. These options use the namespace URI <code>okapi-framework:xmlfilter-options</code>.

{{NoteBox|The filter options must be placed in the parameters file (.fprm) used with the filter, not in embedded or linked ITS rules. Options placed in embedded or linked ITS rules have no effect.}}

When you use several options, they must be set in a single <code><okp:options></code> element, as shown below:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options lineBreakAsCode="yes"
escapeQuotes="no"
escapeGT="yes"
/>
</its:rules></pre>

If you need to switch on an option for certain parts of your XML filter and switch it off for others:

* The ITS definitions are evaluated from top to bottom.
* For ITS definitions placed above the first occurrence of the option, the default setting of the option will be active.
* To switch it off again further down, set the option again with the attribute value "no".

The following options are available:

* [[#lineBreakAsCode|lineBreakAsCode]]
* [[#codeFinder|codeFinder]]
* [[#omitXMLDeclaration|omitXMLDeclaration]]
* [[#escapeQuotes|escapeQuotes]]
* [[#escapeGT|escapeGT]]
* [[#escapeNbsp|escapeNbsp]]
* [[#extractIfOnlyCodes|extractIfOnlyCodes]]
* [[#inlineCdata|inlineCdata]]
* [[#extractUntranslatable|extractUntranslatable]]

====lineBreakAsCode====

In some cases the content of element includes line-breaks that need to be included as part of the content but without using an actual line-break in the extracted text. For example in some XML documents generated by Excel, the formatting of the cells is marked up with <code>&#10;</code> entity references. They need to be passed as inline codes.

By default this option is set to false.

To specify this the filter use the extension <code>lineBreakAsCode</code> extension attribute. This affect all the extracted content.

For example: The following code is an ITS document with the option to treat line-breaks as code. It can be used along with the example of XML document listed below.

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options lineBreakAsCode="yes"/>
</its:rules></pre>

<doc>
<data>line 1&#10;line 2.</data>
</doc>

====codeFinder====

You can define a set of regular expressions to capture span of extracted text that should be treated as inline codes. For example, some element content may have variables, or HTML tags that need to be protected from modification and treated as codes. Use the codeFinder element for this.

In the following parameters file, the <code>codeFinder</code> element defines two rules:

* The first one (rule0) is "<code><(/?)\w[^>]*?></code>" and matches any XML-type tags (e.g. "<code></code>", "<code></code>", "<code> </code>")
* The second one (rule1) is "<code>(#\w+?\#)|(%\d+?%)</code>" and matches any word enclosed in <code>#</code> (e.g. "<code>#VAR#</code>") or number enclosed in <code>%</code> (e.g. "<code>%1%</code>").

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:codeFinder useCodeFinder="yes">#v1
count.i=2
rule0=&lt;(/?)\w+[^&gt;]*?&gt;
rule1=(#\w+?\#)|(%\d+?%)
</okp:codeFinder>
</its:rules></pre>

Some important details:

* Set <code>useCodeFinder</code> to "yes" to have the rules used, if the attribute is missing its value is assumed to be "no".
* Make sure the first line of the <code><codeFinder></code> element content is <code>#v1</code>.
* Each entry in the content must be on a separate line.
* <code>count.i=N</code> must be before any rules and <code>N</code> must be the number of rules.
* <code>ruleN</code> must be incremented starting at 0.
* The pattern for a rule must be escaped for XML, for example: "<code><(/?)\w[^>]*?></code>" must be entered "<code>&lt;(/?)\w[^&lt;]*?&gt;</code>" in the parameters file.
* Do not put spaces before <code>count.i</code> or <code>ruleN</code>, and not after your expressions.

To facilitate the creation of code finder rules [[Rainbow - Code Finder Editor|Rainbow provides the Code Finder Editor]].

====omitXMLDeclaration====

By default an XML declaration is always set at the top of the output document (regardless wether the original document has one or not). It is an important part of the XML document and it is especially needed when the encoding of the output document is not UTF-8, UTF-16 or UTF-32, as its name must be specified in the XML declaration. However, there are a few special cases when the declaration is better left off. To handle those rare cases, you can use <code>omitXMLDeclation</code> to indicate the filter to not output the XML declaration.

For example:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options omitXMLDeclaration="yes"/>
</its:rules></pre>

Remember that XML documents without an XML declaration may be read incorrectly if the encoding of the document is not UTF-8, UTF-16 or UTF-32.

====escapeQuotes====

By default, when processing the document, the filter uses double-quotes to enclose all attributes (translatable or not) and use the following rules for escaping/not-escaping the literal quotes:

* Inside the attribute values:
** Single-quotes (=apostrophes) are never escaped
** Double-quotes are always escaped
* In element content:
** Single-quotes (=apostrophes) are not escaped
** Double-quotes are escaped by default

You cannot change the escaping rules for attributes.

For element content: If the document is processed without triggering any rule that allow the translation of an attribute, then (and only then) the filter takes into account the <code>escapeQuotes</code> option to escape or not double-quotes in the translatable content.

For example, the following parameters file allows to not escape double-quotes in element content (for the documents where there is no rule for translatable attributes are triggered):

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options escapeQuotes="no"/>
</its:rules></pre>

====escapeGT====

By default the character '<code>></code>' is escaped. You can indicate to the filter to not escape it using the <code>escapeGT</code> option.

For example, the following parameters file indicates to not escape greater-than characters:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options escapeGT="no"/>
</its:rules></pre>

====escapeNbsp====

By default the non-breaking space character is escaped (in the form <code>&#x00a0;</code>). You can indicate to the filter to not escape it using the <code>escapeNbsp</code> option.

For example, the following parameters file indicates to not escape the non-breaking space characters:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options escapeNbsp="no"/>
</its:rules></pre>

====extractIfOnlyCodes====

By default all extractable entries are extracted even when they contain only white-spaces and/or inline codes. You can indicate to the filter to not extract such entries using the <code>extractIfOnlyCodes</code> option.

For example, the following parameters file indicates to not extract entries with only whte-spaces and/or inline codes:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options extractIfOnlyCodes="no"/>
</its:rules></pre>

====inlineCdata====

By default, CDATA sections will be exposed as regular content, and the CDATA markers themselves will be discarded. When the <code>inlineCdata</code> option is set,
the CDATA markers will be exposed as inline codes.

For example, the following parameters file will expose CDATA markers as inline codes:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options inlineCdata="yes"/>
</its:rules></pre>

====extractUntranslatable====

All untranslatable entries (<code>its:translate="no"</code>) are not extracted by default. And in order to allow the extraction of such entries for context reasons, the following option has to be used: <code>extractUntranslatable</code>.

Below is an example of this option declaration:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options extractUntranslatable="yes"/>
</its:rules></pre>

==Limitations==

* Currently, in some cases, the ITS rule <code>withinTextRule</code> with the value <code>nested</code> may act like it has a value <code>yes</code> instead.
* In output, the values of the <code>xml:lang</code> attributes are not updated to reflect the target language.
* When doing the extraction, the whole input file is loaded into memory. You may run into memory limitation if the document is very large.

[[Category:Filters]] [[Category:ITS]]

XML Filter

2024-02-22T21:33:36Z

Translate5Support: /* Filter Options */

{{Filters Header}}
==Overview==

This filter allows you to process XML documents. It uses a DOM-based parser, which allows it to implement [[ITS]]. If you need to process very large XML documents and have no need for ITS, you may want to look at using the [[XML Stream Filter]].

The following is an example of a simple XML document. The translatable text is highlighted. Because each format based on XML is different, you need information on what are the translatable parts, what are the inline elements, etc. The XML Filter [[#ITS Support|implements the ITS W3C Recommendation]] to address this issue.

<?xml version="1.0" encoding="utf-8"?>
<myDoc>
<prolog>
<author>Zebulon Fairfield</author>
<version>version 12, revision 2 - 2006-08-14</version>
<keywords><kw>horse</kw><kw>appaloosa</kw></keywords>
<storageKey>articles-6D272BA9-3B89CAD8</storageKey>
</prolog>
<body>
<title>Appaloosa</title>
The Appaloosas are rugged horses originally breed by
the <kw>Nez-Perce</kw> tribe in the US Northwest.
They are often characterized by their spotted coats.
</body>
</myDoc>

This filter is implemented in the class <code>net.sf.okapi.filters.xml.XMLFilter</code> of the library.

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input document using the following logic:

* If the document has an encoding declaration it is used.
* Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).

===Output Encoding===

If the output encoding is UTF-8:

* If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
* If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

If the original document had an XML encoding declaration it is updated, if it did not, one is automatically added.

===Line-Breaks===

The type of line-breaks of the output is the same as the one of the original input.

==Parameters==

This filter stores its parameters in an XML file and does not provide an editor to modify it. You can edit the file in a simple text editor, or with an XML editor. For an example, see the article "[[How to Create a Custom Configuration for the XML Filter]]".

===ITS Support===

By default the filter process the XML documents based on the '''ITS defaults'''. That is:

* the content of all elements is translatable,
* and none of the values of the attribute translatable.

Different behavior can occur if the input document contains ITS markup, or if a filter parameters file is specified. The parameters file used by the the XML Filter is [[ITS|an ITS document]].

The '''Internationalization Tag set (ITS)''' is a W3C recommendation that defines a set of elements and attributes you can use to specify different internationalization- and localization-related aspects of your XML document, for instance: ITS defines what attribute values are translatable, what element content should be protected, what element should be treated as a nested sub-flow of text, and much more.

The filter supports ITS 1.0 and ITS 2.0 (2.0 is backward compatible with 1.0)

* The ITS 1.0 specification is available at http://www.w3.org/TR/its/.
* The ITS 2.0 specification is available at http://www.w3.org/TR/its20/.

See the "[[ITS]]" page for more details on the format.

The filter supports global and local rules and most data categories. See the '''[[ITS Components]]''' page for a detailed list of how the data categories are supported and other information on the implementation.

===ITS Extensions===

The filter supports extensions to the ITS specification. These extension use the namespace URI http://www.w3.org/2008/12/its-extensions.

* [[#idValue and xml:id|idValue and xml:id]]
* [[#whiteSpaces|whiteSpaces]]

====idValue and xml:id====

{{NoteBox|This extension was defined for ITS 1.0, ITS 2.0 offers the new [http://www.w3.org/TR/its20/#idvalue Id Value] data category that should be used instead of this extension.}}

When the attribute <code>xml:id</code> is found on a translatable element, it is used as the name of the text unit generated for that element.

For example, in the example below, the resource name associated with the text unit for the <code></code> element is "<code>id1</code>".

Text

The attribute <code>idValue</code> used in the ITS <code>translateRule</code> element allows you to define an XPath expression that correspeonds to the identifier value for the given selection. The value of <code>idValue</code> must be an expression that can return a string. A node location is a valid expression: it will return the value of the first node at the given location.

For example, in the example below, the resource name associated with the text unit for the <code></code> element is "<code>id1</code>":

<pre><doc>
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
<its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/>
</its:rules>
text 1
</doc></pre>

Note that <code>xml:id</code> has precedence over <code>idValue</code> declaration. For example, in the example below, the resource name associated with the text unit for the <code></code> element is "<code>xid1</code>", not "<code>id1</code>".

<pre><doc>
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
<its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/>
</its:rules>
text 1
</doc></pre>

You can build complex ID based on different attributes, element or event hard-coded text. Any of the String functions offered by XPath can be used.

For example, in the file below, the two elements <code>&tl;text></code> and <code><desc></code> are translatable, but they have only one corresponding ID, the <code>name</code> attribute in their parent element. To make sure you have a unique identifier for both the content of <code><text></code> and the content of <code><desc></code>, you can use the rules set in the example. The XPath expression "<code>concat(../@name, '_t')</code>" will give the ID "<code>id1_t</code>" and the expression "<code>concat(../@name, '_d')</code>" will give the ID "<code>id1_d</code>".

<pre><doc>
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
<its:translateRule selector="//text" translate="yes" itsx:idValue="concat(../@name, '_t')"/>
<its:translateRule selector="//desc" translate="yes" itsx:idValue="concat(../@name, '_d')"/>
</its:rules>
<msg name="id1">
<text>Value of text</text>
<desc>Value of desc</desc>
</msg>
</doc></pre>

====whiteSpaces====

{{NoteBox|This extension was defined for ITS 1.0, ITS 2.0 offers the new [http://www.w3.org/TR/its20/#preservespace Preserve Space] data category that should be used instead of this extension.}}

The extension attribute whiteSpaces allows you to apply globally the equivalent of a local <code>xml:space</code> attribute.

For example, if you have a format where all element <code><pre></code> must have their spaces, tabs and line breaks preserved, you can specify the attribute <code>whiteSpaces="preserve"</code> in a <code><its:translateRule></code> element for the <code><pre></code> elements. In the example below, the spaces in the <code><pre></code> element will be preserved on extraction.

<doc>
<nowiki><its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions"></nowiki>
<its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/>
</its:rules>
<pre>Some txt with many spaces. </pre>
</doc>

Note that the <code>xml:space</code> attribute has precedence over <code>whiteSpaces</code>. For example, in the following example, the white spaces in the content of <code><pre></code> may '''not''' be preserved because the attribute <code>xml:space</code> has the value <code>default</code>:

<doc>
<nowiki><its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions"></nowiki>
<its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/>
</its:rules>
&<pre xml:space="default">Some txt with many spaces. </pre>
</doc>

===Filter Options===

The filter supports also options in addition to ITS and ITS extension. These options use the namespace URI <code>okapi-framework:xmlfilter-options</code>.

{{NoteBox|The filter options must be placed in the parameters file (.fprm) used with the filter, not in embedded or linked ITS rules. Options placed in embedded or linked ITS rules have no effect.}}

When you use several options, they must be set in a single <code><okp:options></code> element, as shown below:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options lineBreakAsCode="yes"
escapeQuotes="no"
escapeGT="yes"
/>
</its:rules></pre>

If you need to switch on an option for certain parts of your XML filter and switch it off for others:

* The ITS definitions are evaluated from top to bottom.
* For ITS definitions placed above the first occurrence of the option, the default setting of the option will be active.
* To switch it off again further down, set the option again with the attribute value "no".

The following options are available:

* [[#lineBreakAsCode|lineBreakAsCode]]
* [[#codeFinder|codeFinder]]
* [[#omitXMLDeclaration|omitXMLDeclaration]]
* [[#escapeQuotes|escapeQuotes]]
* [[#escapeGT|escapeGT]]
* [[#escapeNbsp|escapeNbsp]]
* [[#extractIfOnlyCodes|extractIfOnlyCodes]]
* [[#inlineCdata|inlineCdata]]
* [[#extractUntranslatable|extractUntranslatable]]

====lineBreakAsCode====

In some cases the content of element includes line-breaks that need to be included as part of the content but without using an actual line-break in the extracted text. For example in some XML documents generated by Excel, the formatting of the cells is marked up with <code>&#10;</code> entity references. They need to be passed as inline codes.

By default this option is set to false.

To specify this the filter use the extension <code>lineBreakAsCode</code> extension attribute. This affect all the extracted content.

For example: The following code is an ITS document with the option to treat line-breaks as code. It can be used along with the example of XML document listed below.

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options lineBreakAsCode="yes"/>
</its:rules></pre>

<doc>
<data>line 1&#10;line 2.</data>
</doc>

====codeFinder====

You can define a set of regular expressions to capture span of extracted text that should be treated as inline codes. For example, some element content may have variables, or HTML tags that need to be protected from modification and treated as codes. Use the codeFinder element for this.

In the following parameters file, the <code>codeFinder</code> element defines two rules:

* The first one (rule0) is "<code><(/?)\w[^>]*?></code>" and matches any XML-type tags (e.g. "<code></code>", "<code></code>", "<code> </code>")
* The second one (rule1) is "<code>(#\w+?\#)|(%\d+?%)</code>" and matches any word enclosed in <code>#</code> (e.g. "<code>#VAR#</code>") or number enclosed in <code>%</code> (e.g. "<code>%1%</code>").

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:codeFinder useCodeFinder="yes">#v1
count.i=2
rule0=&lt;(/?)\w+[^&gt;]*?&gt;
rule1=(#\w+?\#)|(%\d+?%)
</okp:codeFinder>
</its:rules></pre>

Some important details:

* Set <code>useCodeFinder</code> to "yes" to have the rules used, if the attribute is missing its value is assumed to be "no".
* Make sure the first line of the <code><codeFinder></code> element content is <code>#v1</code>.
* Each entry in the content must be on a separate line.
* <code>count.i=N</code> must be before any rules and <code>N</code> must be the number of rules.
* <code>ruleN</code> must be incremented starting at 0.
* The pattern for a rule must be escaped for XML, for example: "<code><(/?)\w[^>]*?></code>" must be entered "<code>&lt;(/?)\w[^&lt;]*?&gt;</code>" in the parameters file.
* Do not put spaces before <code>count.i</code> or <code>ruleN</code>, and not after your expressions.

To facilitate the creation of code finder rules [[Rainbow - Code Finder Editor|Rainbow provides the Code Finder Editor]].

====omitXMLDeclaration====

By default an XML declaration is always set at the top of the output document (regardless wether the original document has one or not). It is an important part of the XML document and it is especially needed when the encoding of the output document is not UTF-8, UTF-16 or UTF-32, as its name must be specified in the XML declaration. However, there are a few special cases when the declaration is better left off. To handle those rare cases, you can use <code>omitXMLDeclation</code> to indicate the filter to not output the XML declaration.

For example:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options omitXMLDeclaration="yes"/>
</its:rules></pre>

Remember that XML documents without an XML declaration may be read incorrectly if the encoding of the document is not UTF-8, UTF-16 or UTF-32.

====escapeQuotes====

By default, when processing the document, the filter uses double-quotes to enclose all attributes (translatable or not) and use the following rules for escaping/not-escaping the literal quotes:

* Inside the attribute values:
** Single-quotes (=apostrophes) are never escaped
** Double-quotes are always escaped
* In element content:
** Single-quotes (=apostrophes) are not escaped
** Double-quotes are escaped by default

You cannot change the escaping rules for attributes.

For element content: If the document is processed without triggering any rule that allow the translation of an attribute, then (and only then) the filter takes into account the <code>escapeQuotes</code> option to escape or not double-quotes in the translatable content.

For example, the following parameters file allows to not escape double-quotes in element content (for the documents where there is no rule for translatable attributes are triggered):

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options escapeQuotes="no"/>
</its:rules></pre>

====escapeGT====

By default the character '<code>></code>' is escaped. You can indicate to the filter to not escape it using the <code>escapeGT</code> option.

For example, the following parameters file indicates to not escape greater-than characters:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options escapeGT="no"/>
</its:rules></pre>

====escapeNbsp====

By default the non-breaking space character is escaped (in the form <code>&#x00a0;</code>). You can indicate to the filter to not escape it using the <code>escapeNbsp</code> option.

For example, the following parameters file indicates to not escape the non-breaking space characters:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options escapeNbsp="no"/>
</its:rules></pre>

====extractIfOnlyCodes====

By default all extractable entries are extracted even when they contain only white-spaces and/or inline codes. You can indicate to the filter to not extract such entries using the <code>extractIfOnlyCodes</code> option.

For example, the following parameters file indicates to not extract entries with only whte-spaces and/or inline codes:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options extractIfOnlyCodes="no"/>
</its:rules></pre>

====inlineCdata====

By default, CDATA sections will be exposed as regular content, and the CDATA markers themselves will be discarded. When the <code>inlineCdata</code> option is set,
the CDATA markers will be exposed as inline codes.

For example, the following parameters file will expose CDATA markers as inline codes:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options inlineCdata="yes"/>
</its:rules></pre>

====extractUntranslatable====

All untranslatable entries (<code>its:translate="no"</code>) are not extracted by default. And in order to allow the extraction of such entries for context reasons, the following option has to be used: <code>extractUntranslatable</code>.

Below is an example of this option declaration:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options extractUntranslatable="yes"/>
</its:rules></pre>

==Limitations==

* Currently, in some cases, the ITS rule <code>withinTextRule</code> with the value <code>nested</code> may act like it has a value <code>yes</code> instead.
* In output, the values of the <code>xml:lang</code> attributes are not updated to reflect the target language.
* When doing the extraction, the whole input file is loaded into memory. You may run into memory limitation if the document is very large.

[[Category:Filters]] [[Category:ITS]]