SRX

From Okapi Framework
Revision as of 19:19, 4 June 2016 by Ysavourel (talk | contribs) (1 revision imported)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

The SRX (Segmentation Rules eXchange) format is a standard to save segmentation rules in a file so they can be used between different tools.

It originally maintained by the OSCAR special interest group of the Localisation Industry Standards Association (LISA). In March 2011 LISA was closed and its standards moved under Creative Commons license.

The version 2.0 is the latest version of the specification and can be found here: http://www.gala-global.org/oscarStandards/srx/srx20.html.

SRX rules are grouped into named sets that are activated based the code of the language of the text to process. Each rule defines the text parts before and after the inter-segment location, and specifies if the location should be a break or not. The text parts are defined using regular expressions.

Example of SRX simple rules:

<?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20" version="2.0">
 <header segmentsubflows="yes" cascade="no">
  <formathandle type="start" include="no"></formathandle>
  <formathandle type="end" include="yes"></formathandle>
  <formathandle type="isolated" include="no"></formathandle>
 </header>
 <body>
  <languagerules>
   <languagerule languagerulename="default">
    <rule break="no">
     <beforebreak>([A-Z]\.){2,}</beforebreak>
     <afterbreak>\s</afterbreak>
    </rule>
    <rule break="yes">
     <beforebreak>\.</beforebreak>
     <afterbreak>\s</afterbreak>
    </rule>
   </languagerule>
  </languagerules>
  <maprules>
   <languagemap languagepattern=".*" languagerulename="default"></languagemap>
  </maprules>
 </body>
</srx>

In this example, there are two rules.

The second one specifies that when an inter-character location is preceded by a period and followed by a white space, the rule is to break at that position.

The first rule specifies that when an inter-character location is preceded by the patter ([A-Z]\.){2,} and followed by a white paces, the rule is to not break at that position. Because the first rule is placed before the second rule it takes precedence.

So, based on those rules, the following text:

I'm in the U.K. for now. But I plan to move to Papua New Guinea.

will break down into two segments:

[I'm in the U.K. for now.]
[ But I plan to move to Papua New Guinea.]

If the first rule was not there, it would break down into three segments:

[I'm in the U.K.]
[ for now.]
[ But I plan to move to Papua New Guinea.]

SRX Versions Issue

There are two versions of SRX: 1.0 and 2.0.

SRX version 1.0 has been implemented by several tools that interpreted how to process the SRX rules in different ways. As a result the same SRX 1.0 document used on different tools may give you different segmentation.

To resolve this issue, an updated version 2.0 specification has been published and provides better implementation guidelines. So, in theory, the same version 2.0 document should give you the same segmentation in all tools.

You can find the specifications of SRX on the LISA web site:

Implementation Differences for SRX 1.0

There are two main types of implementations of SRX 1.0: the intended one, and one that use a cascading matching of the language maps.

Tools like SDLX implemented the intended SRX 1.0 behavior (non-cascading). Others, like Swordfish implemented SRX 1.0 with a cascading behavior.

In an SRX document, the segmentation rules are grouped into several <languagerule> elements. This way you can define different sets of rules that you apply for different languages. The select of which group of rules is to use for a given language is driven by a table defined in the <maprules> element. Each entry in <maprules> is a <languagemap>. This entry has two information: a regular expression pattern that corresponds to what language code should use the entry, and a pointer to the group of rules for this entry.

<languagerules>
 <languagerule languagerulename='default'>
 </languagerule>
 <languagerule languagerulename='japanese'>
 </languagerule>
<languagerules>

<maprules>
 <languagemap languagepattern='ja.*' languagerulename='japanese'/>
 <languagemap languagepattern='.*' languagerulename='default'/>
</maprules>

The difference between the SRX 1.0 implementations is how they lookup the <maprules> for a given language code.

  1. Some will use only the first <languagemap> that has a languagepattern matching the language code.
  2. Other will use all <languagemap> that have a languagepattern matching the language code.

The first interpretation is the correct one: In SRX 1.0 you use the only first <languagemap> that matches the given language code.

It is true that there is nothing in the SRX 1.0 specification that says explicitly it should work that way. But there is also nothing explicitly (or implicitly) that says all matching <languagemap> should be used.

The clue to the intended behavior is in the example of the SRX 1.0 specification:

<languagerules>
 <languagerule languagerulename="Default">
  <rule break="no">
   <beforebreak>^\s*[0-9]+\.</beforebreak>
   <afterbreak>\s</afterbreak>
  </rule>
  <rule break="no">
   <beforebreak>[Ee][Tt][Cc]\.</beforebreak>
   <afterbreak>\s[a-z]</afterbreak>
  </rule>
  ...
 </languagerule>
 <languagerule languagerulename="Japanese">
  <rule break="no">
   <beforebreak>^\s*[0-9]+\.</beforebreak>
   <afterbreak>\s</afterbreak>
  </rule>
  <rule break="no">
   <beforebreak>[Ee][Tt][Cc]\.</beforebreak>
   <afterbreak></afterbreak>
  </rule>
  <rule break="yes">
   <beforebreak>[\xff61\x3002\xff0e\xff1f\xff01]+</beforebreak>
   <afterbreak></afterbreak>
  </rule>
  ...
 </languagerule>
</languagerules>

<maprules>
 <maprule maprulename="Default">
  <languagemap languagepattern="JA.*" languagerulename="Japanese"/>
  <languagemap languagepattern=".*" languagerulename="Default"/>
 </maprule>
</maprules>

In this example, there is the same rules defined in both the Default and the Japanese groups. If SRX 1.0 intended to use all the <languagemap> elements that match the given language code, there would be no point to have duplicated rules in Japanese. The Japanese group would have only the extra Japanese-specific rules.

How to Convert From SRX 1.0 to SRX 2.0?

The SRX 2.0 specification resolve the cascading issue by making it an option.

When loading or importing SRX 1.0 documents into an SRX 2.0 editor, you must be careful about setting properly the cascade option depending on the provenance of the document.

  • SRX 1.0 rules coming from Trados, SDLX and some other tools that implement the normal SRX 1.0 behavior (no cascading). So you should make sure that option is not set after you open the file.
  • SRX 1.0 rules coming from Heartsome, Swordfish, and some other tools that are designed with cascading. So you should make sure that option is set after you have open the file.

SRX and Java

The SRX standard uses ICU regular expressions, however it is very difficult to implement the same set of expression using Java and some other programing languages.

See more details in the SRX and Java section.

SRX in the Okapi Framework

The Okapi framework uses SRX in many places. For example:

Note that the framework implements a few extensions to SRX.