Ratel

SRX Standard

If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://www.opentag.com/okapi/wiki/index.php?title=SRX

Overview

The SRX (Segmentation Rules eXchange) format is a standard to save segmentation rules in a file so they can be used between different tools.

SRX rules use regular expressions to define the text parts before and after inter-character location, and specify if the location should be a break or not. Rules are grouped into named sets that are activated based the code of the language of the text to process.

Example of SRX simple rules:

<?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20" version="2.0">
 <header segmentsubflows="yes" cascade="no">
  <formathandle type="start" include="no"></formathandle>
  <formathandle type="end" include="yes"></formathandle>
  <formathandle type="isolated" include="no"></formathandle>
 </header>
 <body>
  <languagerules>
   <languagerule languagerulename="default">
    <rule break="no">
     <beforebreak>([A-Z]\.){2,}</beforebreak>
     <afterbreak>\s</afterbreak>
    </rule>
    <rule break="yes">
     <beforebreak>\.</beforebreak>
     <afterbreak>\s</afterbreak>
    </rule>
   </languagerule>
  </languagerules>
  <maprules>
   <languagemap languagepattern=".*" languagerulename="default"></languagemap>
  </maprules>
 </body>
</srx>

In this example, there are two rules.

So, based on those rules, the following text:

I'm in the U.K. for now. But I plan to move to Papua New Guinea.

will break down into two segments:

[I'm in the U.K. for now.]
[ But I plan to move to Papua New Guinea.]

If the first rule was not there, it would break down into three segments:

[I'm in the U.K.]
[ for now.]
[ But I plan to move to Papua New Guinea.]

SRX Versions Issue

There are two versions of SRX: 1.0 and 2.0.

SRX version 1.0 has been implemented by several tools that interpreted how to process the SRX rules in different ways. As a result the same SRX 1.0 document used on different tools may give you different segmentation.

To resolve this issue, an updated version 2.0 specification has been published and provides better implementation guidelines. So, in theory, the same version 2.0 document should give you the same segmentation in all tools.

You can find the specifications of SRX on the LISA web site:

Implementation Differences Between SRX 1.0 and SRX 2.0

There are two main types of implementations of SRX 1.0: the intended one, and one that use a cascading matching of the language maps.

Tools like SDLX implement the intended SRX 1.0 behavior (non-cascading). Others, like Heartsome and Swordfish implement SRX 1.0 with a cascading behavior.

In an SRX document, the segmentation rules are grouped into several <languagerule> elements. This way you can define different sets of rules that you apply for different languages. The select of which group of rules is to use for a given language is driven by a table defined in the <maprules> element. Each entry in <maprules> is a <languagemap>. This entry has two information: a regular expression pattern that corresponds to what language code should use the entry, and a pointer to the group of rules for this entry.

<languagerules>
 <languagerule languagerulename='default'>
 </languagerule>
 <languagerule languagerulename='japanese'>
 </languagerule>
<languagerules>

<maprules>
 <languagemap languagepattern='ja.*' languagerulename='japanese'/>
 <languagemap languagepattern='.*' languagerulename='default'/>
</maprules>

The difference between the SRX 1.0 implementations is how they lookup the <maprules> for a given language code.

  1. Some will use only the first <languagemap> that has a languagepattern matching the language code.
  2. Other will use all <languagemap> that have a languagepattern matching the language code.

The first interpretation is the correct one: In SRX 1.0 you use the only first <languagemap> that matches the given language code.

It is true that there is nothing in the SRX 1.0 specification that says explicitly it should work that way. But there is also nothing explicitly (or implicitly) that says all matching <languagemap> should be used.

The clue to the intended behavior is in the example of the SRX 1.0 specification:

<languagerules>
 <languagerule languagerulename="Default">
  <rule break="no">
   <beforebreak>^\s*[0-9]+\.</beforebreak>
   <afterbreak>\s</afterbreak>
  </rule>
  <rule break="no">
   <beforebreak>[Ee][Tt][Cc]\.</beforebreak>
   <afterbreak>\s[a-z]</afterbreak>
  </rule>
  ...
 </languagerule>
 <languagerule languagerulename="Japanese">
  <rule break="no">
   <beforebreak>^\s*[0-9]+\.</beforebreak>
   <afterbreak>\s</afterbreak>
  </rule>
  <rule break="no">
   <beforebreak>[Ee][Tt][Cc]\.</beforebreak>
   <afterbreak></afterbreak>
  </rule>
  <rule break="yes">
   <beforebreak>[\xff61\x3002\xff0e\xff1f\xff01]+</beforebreak>
   <afterbreak></afterbreak>
  </rule>
  ...
 </languagerule>
</languagerules>

<maprules>
 <maprule maprulename="Default">
  <languagemap languagepattern="JA.*" languagerulename="Japanese"/>
  <languagemap languagepattern=".*" languagerulename="Default"/>
 </maprule>
</maprules>

In this example, there is the same rules defined in both the Default and the Japanese groups. If SRX 1.0 intended to use all the <languagemap> elements that match the given language code, there would be no point to have duplicated rules in Japanese. The Japanese group would have only the extra Japanese-specific rules.

How to Convert From SRX 1.0 to SRX 2.0?

When loading or importing SRX 1.0 documents into an SRX 2.0 editor, you must be careful about setting properly the cascade option (introduced in v2.0) depending on the provenance of the document.

Note that Ratel does not set the cascade automatically because it is not the expected behavior of SRX 1.0.

SRX and Java

The SRX standard uses ICU regular expressions, however it is very difficult to implement the same set of expression using Java.

See more details in the SRX and Java section.