Okapi Framework - Developer's Guide

Segmentation

- Overview
- Performing Segmentation
- Working with Segmented Content

Overview

Segmentation, in the context of the Okapi Framework, is the action of breaking down a given content into parts. For example, taking the content of an extracted HTML <p> element, and breaking it down into sentences.

Segmentation is of great importance in localization tasks. It allows you to define the granularity of the parts of text that are being translated, matched against translation memories, processed by machine translation, etc. Having different segmentation methods is often one of the cause of loosing re-usable data when going from one tool to another.

In the framework, the basic unit of extraction is the text unit, which corresponds to different things depending on original file format. But, roughly, it is an un-segmented chunk of text that may be composed of several sentences. Many of the tasks performed on the text units require to manipulate the unit at a finer level: the segment. This page discusses segmentation in that context.

Performing Segmentation

The framework provides one interface to apply segmentation to a text content, the ISegmenter interface.

How each implementation of ISegmenter works may be different. In this example we will use one default implementation of ISegmenter that is provided with the framework: SRXSegmenter. As its name indicates it is based on the SRX standard.

To instantiate this segmenter you must first create an SRXDocument object and load or set the SRX rules to use.

SRXDocument doc = new SRXDocument();
doc.load(myRules.srx);

Then you can obtain a segmenter for a given language.

ISegmenter segmenter = doc.compileLanguageRules(LocaleId.fromString("en"), null);

The second parameter of compileLanguageRules() is an optional segmenter object, in case you already have one and want to avoid the cost of re-creating one. You can just pass null to create a brand new one.

Once you have a segmenter with its rules set, you can use it to create segments on a given content. You can calculate the segments for a given plain text string or for a TextContainer.

Question: I've looked at the SRX specification and it seems quite complicated to write rules. Is there an easy way to create and edit SRX documents?

Answer: Sure. You can use Okapi's own SRX editor: Ratel (named after the tough honey-badger that roams the plains of Africa). You can download it from here. Just start Ratel and drag and drop your SRX document on it. The rules are applied on-the-fly to any sample text you enter.

With Plain Text

Here is an example of getting the segmentation for a plain text string:

int count = segmenter.computeSegments("Part 1. Part 2.");
System.out.println("count="+String.valueOf(count));
for ( Range range : segmenter.getRanges() ) {
   System.out.println(String.format("start=%d, end=%d",
      range.start, range.end));
}

The ISegmenter.computeSegments() method returns the number of segments it founds. It also creates internally a list of the ranges of these segments. You can get that list with the ISegmenter.getRanges() method. Each entry of the list is a Range object that contains a start and an end values corresponding to the boundaries of the segment in the given text. The text of the segment goes from the character at the start position to the character just before the end position (Just like the String.subString() argument in Java).

For example the code above will display this:

count=2
start=0, end=7
start=7, end=15

The first segment starts at 0 and ends at 7, so it corresponds to "Part 1." And the second segment starts at 7 and ends at 15, so it corresponds to " Part 2.".

Part 1. Part 2.
0000000000111111
0123456789012345

While the segmenter is designed to work with coded text as we will see below, you can also use it on any kind of normal text as long as you have rules that correspond to your text format.

With a TextContainer and TextFragment

The method ISegmenter.computeSegments() can also take a TextContainer as parameter and works like for plain text, but in addition it takes into account the possible inline code in the content.

For example, given a breaking rule with the text before the break set as a period and the text after the break set as a space, in which segment should go the inline codes </span> and <alone/>?

<span>Part 1.</span> Part 2.<alone/> Part 3.

The SRX standard has options for these cases. The default options are as follow:

To try out the segmenter with inline codes we have first to build a TextFragment object with the proper content:

TextFragment tf = new TextFragment();
tf.append(TagType.OPENING, "span", "<span>");
tf.append("Part 1.");
tf.append(TagType.CLOSING, "span", "</span>");
tf.append(" Part 2.");
tf.append(TagType.PLACEHOLDER, "alone", "<alone/>");
tf.append(" Part 3.");

Based on the TextFragment, you can then create an instance of TextContainer:

TextContainer tc = new TextContainer(tf);
The calculation of the segmentation itself is the same as before.

segmenter.computeSegments(tc);

To make things easier, the ISegments interface, which the TextContainer class implements, offers a method to apply the ranges provided by the segmenter to the text content in one call: ISegments.create() that takes a list of ranges as parameters.

tc.getSegments().create(segmenter.getRanges());

And you can retrieve each segments of the now segmented container with the TextContainer.getSegments(). The Segment class provides a simple structure to hold together the TextFragment object corresponding to the segment, and the identifier of the segment.

for ( Segment seg : tc.getSegments() ) {
   System.out.println("segment=[" + seg.toString() + "]");
}

The code above results in the following output:

segment=[<span>Part 1.</span>]
segment=[ Part 2.]
segment=[<alone/> Part 3.]

Note that the SRX specification is unclear on what is the proper behavior of the segmenter for the cases where there are several consecutive inline codes just after the break point, the specification mentions only the cases with a single code. In such cases the SRX implementation in Okapi behaves like if the several inline codes are a single code as long as they are of the same type.

Working with Segmented Content

A TextContainer is represented by a set of TextPart objects. Each part that represent a segment is a Segment object, the other represent the content between segments.

TextContainer: "Segment 1. ... Segment 2."
TextPart(0)/Segment(0): "Segment 1."
TextPart(1)           : " ... "
TextPart(2)/Segment(1): "Segment 2."

Note that even content on which no segmentation rules has been applied are represented as a part that is a Segment object. This allows you to treat segmented and un-segmented content the same way.

The segments of a content can be accessed different ways, an easy one id to use the getSegments() methdod and then access each segments from there. For example, if you need to go through the segments (and only the segments) of a TextContainer, you would do something like this:

for ( Segment seg : tc.getSegments() ) {
   System.out.println("segment->[" + seg.toString() + "]");
}

The method getSegments() creates an instance of the  ISegments interface that allows you to access the segments. Beware that each call re-create the object, you want to avoid calling this method inside loops or in any place where it is called several times.

If you need to go through all the parts (segments are not) of a TextContainer, you would do something like this:

for ( TextPart part : tc ) {
   if ( part.isSegment() ) {
      System.out.println("segment->[" + part.toString() + "]");
   }
   else {
      System.out.println("non-segment->{" + part.toString() + "}");
   }
}

Segments can be accessed through their index or their identifier. When working with both source and target segments it is recommended to use the identifier because a source segment and its corresponding target may not have the same index as they may be ordered differently in the translation.