Okapi Framework - Developer's Guide

Getting Started

- Overview
- Reading a Document
- Modifying a Document
- Working with Text Units

Overview

Applications like Rainbow are built on top of the framework, but it is quite easy to develop other tools and scripts that use directly the framework components. With these components you can do things like:

Read documents in different formats into a set of standardized resources.
Store the extracted text into a common representation, where text is separated from inline codes.
Manipulate the extracted text and any modifiable properties associated with it.
Re-write the document in its original format, but with its extractable parts modified if needed.
Segment extracted text into sentences.
Chain sequences of independent steps that perform different actions on the resources generated by the filters.
and much more...

This section describes how to write simple programs that use the Okapi Framework components to perform simple tasks such as reading and modifying a document.

Reading a Document

One of the most important actors in the framework is the filter. All filters are accessible through a single common API: the IFilter interface. They take an input document in a given format and generate events that give you access to the extractable text of the document, as well as the properties associated to it.

Question: Are these events like Java AWT events, working with a listener, etc?

Answer: Not at all. In the context of the framework, an "event" is not related to Java "listeners" or an "event sinks". They are just the units of information (possibly with attached data) that are used to communicate between components. In the case of IFilter, events are obtained from the IFilter.next() method.

The filters are designed to work through a pipeline. A pipeline is a set of sequential steps, each step receiving events, processing it (if needed) and sending it to the next step. The framework has pre-defined components to build and execute pipelines easily, but we will see that later.

For now, let's concentrate on the filter itself. You do not need to use a pipeline to work with filters, as long as you know what event to send and receive.

Most events have associated resources. Each resource is different depending on the type of event. Extracted text, properties, as well as grouping information are carried in those resources. The following table show all the kinds of event that can go through a pipeline.

Events and Corresponding Resources:

Events	Resource	Filter-Specific Event?
`START_BATCH`	none	no (starts a batch)
`START_BATCH_ITEM`	none	no (starts a batch item)
`RAW_DOCUMENT`	`RawDocument`	no (corresponds to a document by itself).
`START_DOCUMENT`	`StartDocument`	yes
`START_SUBDOCUMENT`	`StartSubDocument`	yes
`START_GROUP`	`StartGroup`	yes
`DOCUMENT_PART`	`DocumentPart`	yes
`TEXT_UNIT`	`TextUnit`	yes
`END_GROUP`	`Ending`	yes
`END_SUBDOCUMENT`	`Ending`	yes
`END_DOCUMENT`	`Ending`	yes
`CANCELED`	none	no (used when a process is canceled)
`FINISHED`	none	no (used at the end of a set of documents)
`NO_OP`	none	no (used for no-operation)
`CUSTOM`	custom resource	no (used for custom events)
`END_BATCH_ITEM`	none	no (ends a batch item)
`END_BATCH`	none	no (ends a batch)

A filter always generates at least the START_DOCUMENT and END_DOCUMENT events. All other filter-specific events may or may not be generated depending on each filter and each document. See the section Filter Events for more details.

Any event that is not understood by a component should be simply passed along down the pipeline.

Creating the Filter

To use a filter, you first have to create a filter object. The Okapi Framework provides you with several filters, and you can write your own as well. In the example below, we use the net.sf.okapi.filters.properties.PropertiesFilter class which implements the IFilter interface for Java Properties files.

// Create a filter object
IFilter filter = new PropertiesFilter();

Setting the Filter Parameters

Some filters have options that are specific to the format they parse. You can set these filter-specific parameters with the method IFilter.setParameters(). The method IFilter.getParameters() allows you to retrieve the current parameters.

The PropertiesFilter has such options, but the defaults are just fine for the example so we won't use any additional parameters.

Question: Is there an easy way to edit the filter-specific parameters?

Answer: It depends. The parameters are accessible through the IParameters interface and, at the least, this interface allows to save the parameters to a file, that file can be a simple property-like text file. Some filters also provide UI for modifying their parameters. See the Filter Parameters section of the guide for more details.

Opening the Input Document

The next step is to open the document. This is done with the help of an RawDocument object, which carries all the information the filter needs to open the input document:

The object representing the document to process. This can be one (and only one) of the following:
- a CharSequence object that contains the document itself. (String is a type of CharSequence).
- a URI object pointing to the physical document.
- an InputStream object.
The default encoding of the document to process.
The language/locale of the text to extract.
And, optionally, the target language to work with (for multilingual formats).

In the code below we set the source language to English, using the same standard language tag identifiers you would use with HTML and XML (BCP-47). We set the default encoding to UTF-8. Note that some filters will be able to automatically detect the correct encoding of the input and ignore this value.

// Creates the RawDocument object
RawDocument res = new RawDocument(myInputStream, "UTF-8", LocaleId.fromString("en"));
// Opens the document
filter.open(res);

By default a filter generates the skeleton for the document (we'll see more about the skeleton later). If, for some reason, you do not want to generate the skeleton you can specify an additional parameter and use:

// Opens the document, without generating skeleton
filter.open(res, false);

Question: Why use an intermediate object for opening the document? Wouldn't it be simpler to just pass the parameters to open()?

Answer: The RawDocument object is used because in many cases filters are used in the broader environment of pipelines, where having all information about the input document in one object makes things much easier. The object is also the resource associated with the RAW_DOCUMENT event that allows to mix in the same pipeline steps using filter events or steps working on the input document directly.

Note that while the filters should do their best to implement all the different input objects, this may not be possible in some cases because of the format the filter deals with. For example, a filter for Adobe Photoshop PSD files does not support CharSequence input objects, as PSD files are binary.

In our example we want to open a String-based document:

// Creates the RawDocument object
RawDocument res = new RawDocument("key1=Text1\nkey2=Text2", LocaleId.fromString("en"));
// Opens the document
filter.open(res);

This is equivalent to opening a physical properties file on the disk that has the following content:

key1=Text1
key2=Text2

Processing the Input Document

Everything is now in place to start processing the input document.

We use the method IFilter.hasNext() to see if there is any event to access. If there is, we use the method IFilter.next() to get the actual event.

You must always call IFilter.hasNext() before calling IFilter.next().
You must call IFilter.next() once and only once before the next call to IFilter.hasNext().
Do not rely on IFilter.next() returning null if there are no more events.

// Proper call of hasNext() and next()
while ( filter.hasNext() ) {
   Event event = filter.next();
   // Do something...
}

Once you have an event, you can query its type using Event.getEventType(). The value retuned is one of the constants defined in EventType. Then, depending on the type, you can get the resource associated with this event (Event.getResource()) and access the data in the resource.

After the last event for the input document (END_DOCUMENT) has been read, IFilter.hasNext() returns false, and you can close the input. Note that it is a good practice for the filters to close the input themselves before sending the last event, but you should still call close just in case.

// Get the events from the input document
while ( filter.hasNext() ) {
   Event event = filter.next();
   // do something with the event...
   // Here, if the event is TEXT_UNIT, we display the key and the extracted text
   if ( event.getEventType() == EventType.TEXT_UNIT ) {
      TextUnit tu = (TextUnit)event.getResource();
      System.out.println("--");
      System.out.println("key=["+tu.getName()+"]");
      System.out.println("text=["+tu.getSource()+"]");
   }
}
// Close the input document
filter.close();

This should generate the following output:

--
key=[key1]
text=[Text1]
--
key=[key2]
text=[Text2]

Modifying a Document

Extracting the text parts of a document is useful, but a more useful feature the Okapi Framework offers is writing out the extracted data back into the original format.

As we have seen above, when you open a document with a filter, you can specify to generate the skeleton. The role of the skeleton is to store information about the parts of the input document that are not extractable, and to provide ways to merge back the parts that are extractable.

Because file formats are very different, they may need to use different types of skeleton mechanisms. For example, the skeleton for a binary file such as an OpenOffice.org ODT file (which is really a ZIP file) cannot be treated the same way as the skeleton of a Java properties file. The framework offers a transparent way to work with the different skeletons and lets the user ignore the underlying mechanism.

The skeleton parts are passed along with the resources of the events. A resource may or may not have an associated skeleton object.

To re-construct the original file format you need both the extracted resources and the skeleton parts passed through the events. The framework provides the IFilterWriter interface to do all this transparently.

Creating the Filter and the FilterWriter

First, you must create the filter, just like before, except we will use the HTML filter this time:

// Create a filter object
IFilter filter = new HtmlFilter();

Next, you need to create an IFilterWriter object. You do this by calling a method of the filter itself (IFilter.createFilterWriter()) that provides you with the proper implementation of IFilterWriter for the given format it supports.

// Create the filter writer
IFilterWriter writer = filter.createFilterWriter();

Configuring the FilterWriter

Once the IFilterWriter object is created, you need to set its options. This is done with the IFilterWriter.setOptions() method.

We need to set the output language. In this case we will use French. We also need to indicate which encoding to use for the out output. In our example, we will choose Latin-1.

// Set the filter writer's options
writer.setOptions(LocaleId.fromString("fr"), "iso-8859-1");

We also need to set where the output will be generated. The type of object used for output can be different from the one used for the input. For example, here we will use a string as the input document, and write the output to a physical file.

There are different methods to set the output:

Providing an OutputStream object.
Providing a URI object pointing to a physical location.

We are using the second method in this example:

// Set the output
writer.setOutput("myFile_fr.html");

Note that the output document is not created when you set the output, but only when the filter will start sending events.

Question: Can the output file be the same as the input file?

Answer: Yes, you should be able to overwrite the input document. However, to ensure it will work, you should always close the input document before closing the output document.

Opening the Input Document

The next step is to open the input document with the filter. This time we will use an HTML string:

// Open the input from a CharSequence
filter.open(new RawDocument("<html><head>\n"
   + "<meta http-equiv='Content-Language' content='en'></head>\n"
   + "<body>\n"
   + "<p>Text in <b>bold</b>.</p>"
   + "</body></html>", LocaleId.fromString("en")));

Processing the Document Without Making Changes

Now that all is set, we can process the document.

Re-writing the input document is achieved simply: You call the IFilterWriter.handleEvent() method each time you get an event from the filter, and then close both input and output when all events have been processed. (Remember that you should always close the input document before the output document, in case you are writing to the same file).

// Processing the input document
while ( filter.hasNext() ) {
   writer.handleEvent(filter.next());
}
// Closing the filter and the filter writer
filter.close();
writer.close();

The code above should create a new file called myFile_fr.html in your current directory and its content should be like this:

<html><head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv='Content-Language' content='fr'></head>
<body>
<p>Text in <b>bold</b>.</p></body></html>

As you can see, the filter writer make some modifications automatically: the HTML language declaration has been updated to reflect the target language you specified: "fr". The remaining of the content is the same as the input.

Processing the Document and Making Changes

Obviously the real interest of the filter writer is to save changes made to the extracted text into its original format.

To perform changes in the extracted text you need to handle the TEXT_UNIT event, which comes with a TextUnit resource where the source text is stored.

It is always good practice to isolate the place where you code your changes, so we will create a method for it. Our changeTU() method takes one parameter: The TextUnit resource provided by the TEXT_UNIT event. The modifications are done directly in that object.

Before we do any change, we need to check if this text unit is actually translatable. While most of the extracted text is translatable, there are cases where, for different reasons, the provider of the events (here a filter) decided to protect the content of the text unit. A good example of this is the XLIFF filter: It returns one text unit for each <trans-unit> of the XLIFF document, but some of those <trans-unit> may have their translate attribute set to no. The TextUnit.isTranslatable() method allows you to verify if a given text unit is translatable or not, as shown below:

void changeTU (TextUnit tu) {
   // Check if this unit can be modified
   if ( !tu.isTranslatable() ) return; // If not, return without changes

Once we have established that we can modify the text, we need to create a copy of the source content for the target.

One important thing to keep in mind when working with filters is that some input documents can be multilingual (for example a PO file, or an XLIFF document). Because of that you may actually already have a target text in your text unit.

The method TextUnit.hasTarget() can check if a target for a given language exists already. But there is a more convenient way to create the target conditionally. The TextUnit.createTarget() is design for this. It takes several parameters:

The language of the target entry to create (here "fr").
A flag indicating if you want to overwrite the content of a possible existing target for that language. Set it to true to create a new entry even if one exists already. Set it to false to use the existing entry or to create a new entry if none exists.
A flag indicating what to copy from the source. Use IResource.COPY_ALL to copy everything.

   TextContainer tc = tu.createTarget(LocaleId.fromString("fr"), false, IResource.COPY_ALL);

The result is a TextContainer object, that holds all the target-related data: text, as well a properties, annotations, etc.

Question: Is the language code case-sensitive?

Answer: No. When a language or a locale identifier is set to a LocaleId object, it is normalized, so "fr" and "FR" are seen as identical.

To make any modification to the content you need to work with a string of coded text. It is a string with some special characters that markup inline codes. Coded text string can usually be manipulated like a normal string, with some exceptions.

For this example, we want to convert the text to upper cases, and we can work without problem directly with the coded text. The content is accessible for each segment and we can use the TextFragment.getCodedText() method. When the conversion is done you have to set the modified string back into the TextFragment using the TextFragment.setCodedText() method.

   ISegments segs = tc.getSegments();
   for ( Segment seg : segs ) {
      TextFragment tf = seg.getContent();
      tf.setCodedText(tf.getCodedText().toUpperCase());
   }
}

With our changeTU() method done, we can now add it to the main loop of the filter's event.

while ( filter.hasNext() ) {
   Event event = filter.next();
   if ( event.getEventType() == EventType.TEXT_UNIT ) {
      changeTU((TextUnit)event.getResource());
   }
   writer.handleEvent(event);
}
filter.close();
writer.close();

The output of our new program should look like this:

<html><head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv='Content-Language' content='fr'></head>
<body>
<p>TEXT IN <b>BOLD</b>.</p></body></html>

Working with Text Units

One of the most important events generated by the filters is the TEXT_UNIT event. It corresponds to a logical unit of extractable text of the input document. For example, the content of a  element in HTML, or the value of a key/value pair in a Java properties file. A text unit corresponds more or less to a <trans-unit> element in XLIFF.

The text unit holds source and target data for the given extracted text, as well as properties (for the whole unit, for the source, as well as for each target), and annotations (also for the whole unit, for the source, as well as for each target). It is also hold its corresponding skeleton object (if there is one).

The bottom line is that you can access the source text from the text unit, as well create new translation entries or access existing ones (if the input document is multilingual).

Each language has a corresponding TextContainer object that holds the text as well as its associated properties and annotations. the text itself is in a TextFragment object. Those parts are easily accessible from the text unit:

TextUnit tu = new TextUnit("id1");
tu.setSourceContent(new TextFragment("My text"));

TextContainer tc = tu.getSource();
TextFragment tf1 = tc.getContent();
// Or
TextFragment tf2 = tu.getSourceContent();

In the example above both tf1 and tf2 point to the same object: The source text content of the text unit.

Once you have a TextFragment you can manipulate it almost like a classic string:

tf1.append(' ');
tf1.append("is this.");
// Prints "My text is this"
System.out.println(tf1.toString());

tf1.insert(3, new TextFragment("first "));
// Prints "My first text is this."
System.out.println(tf1.toString());

tf1.remove(13, 21);
// Prints "My first text."
System.out.println(tf1.toString());

There is, however, one major difference between a TextFragment and a string: the inline codes.

Inline Codes

Inline codes are spans of the extracted content that are not real text, but codes/markup embedded in the text. They often represent formatting information. For example, in the HTML content "Text in bold." the two tags "" and "" are inline codes.

A TextFragment object can contain many inline codes:

TextUnit tu = new TextUnit("id1");
TextFragment tf = tu.setSourceContent(new TextFragment("Text in "));
tf.append(TagType.OPENING, "bold", "<b>");
tf.append("bold");
tf.append(TagType.CLOSING, "bold", "</b>");
tf.append(".");
// Prints "Text in <b>bold</b>."
System.out.println(tf.toString());

Separating text from code allows translation tools to work in a more abstract way. For example an HTML text "Text in bold." can be represented the same way in a TextFragment. This allows better handling of the content: Improve translation memory leveraging; comparing codes between source and target; working with the text (e.g. spell-checking) without having the code be in the way; and much more.

The content is separated into two parts: a coded text string where you have the real text and special markers for each code; and the list of the codes themselves. You can access the coded text with the TextFragment.getCodedText() method, and the list of codes with the TextFragment.getCodes() method. Most of the time simple utilities need only to access the coded text.

String text = tf.getCodedText();
List<Code> codes = tf.getCodes();

The coded text part contains placeholders to represent the inline codes. each one is composed of two special Unicode characters:

The first one represent the kind of inline code:
- U+E101: opening
- U+E102: closing
- U+E103: standalone
- There is also a special character U+E104 that is used for segment placeholder when the text is segmented.
The second one is the zero-based index of the code in the list of codes, starting at U+E110.

All these special characters are in the Private Use Area of Unicode.

Normal: "Text in <b>bold</b>."
 Coded: "Text in \uE101\uE110bold\uE102\uE111."

The following method takes a TextFragment and counts the number of characters in the real text part of the coded text. You can use the TextFragment.isMarker() helper method to check if a given character is an inline code marker or not. If it is one, you need to skip the next character as it represents the index of the inline code to the list of codes.

private static int countChars (TextFragment tf) {
   String text = tf.getCodedText();
   int count = 0;
   for ( int i=0; i<text.length(); i++ ) {
      if ( TextFragment.isMarker(text.charAt(i)) ) i++;
      else count++;
   }
   return count;
}

If you apply the method above to our TextFragment and compare it to the other length counts you get:

tf.getString().length() = 20
tf.getCodedText().length() = 17
countChars(tf) = 13

Modifying Text

If you modify a coded text string, you need to set the modified string back into the TextFragment object. This is done with one of the TextFragment.setCodedText() methods.

The first method sets the coded text, and re-uses the codes that are currently in the TextFragment. This implies that the inline code markers in the coded text you have modified must be unchanged. Extra or missing codes will trigger an error.

// Prints "Text in <b>bold</b>."
System.out.println(tf.toString());

String text = tf.getCodedText();
text = text.toUpperCase();
tf.setCodedText(text);

// Prints "TEXT IN <b>BOLD</b>."
System.out.println(tf.toString());

The second method is to set the new coded text and indicate that missing inline code markers in your new text means the corresponding codes in the TextFragment should be deleted. Only extra codes will trigger an error.

// Prints "TEXT IN <b>BOLD</b>."
System.out.println(tf.toString());

text = tf.getCodedText();
text = text.substring(0, 14);
// Allows the deletion of "</b>"
tf.setCodedText(text, true);

// Prints "TEXT IN <b>BOLD"
System.out.println(tf.toString());

Question: When the "" code was originally added to the text it was set with a TagType.OPENING flag. Now that it does not have a corresponding closing tag, don't we have to change its type to something else?

Answer: No. The TagType flag remains the same ("" is still a start tag). But the marker in the coded text for this inline code should now be MARKER_ISOLATED instead of MARKER_OPENING. This change was done automatically for you when we called setCodedText(). We will see more information about how tag types and markers relate to each other later.

The third method is to specify the list of codes along with the modified coded text. This allows you complete control over the inline codes. If the list of codes you provide does not match the inline codes in the coded text string it will trigger an error.

// Prints "TEXT IN <b>BOLD"
System.out.println(tf.toString());

text = tf.getCodedText();
// Create a new set of codes
List<Code> codes = new ArrayList<Code>();
codes.add(new Code(TagType.OPENING, "italic", "<i>"));
codes.add(new Code(TagType.CLOSING, "italic", "</i>"));
// Replace the text "BOLD" by "ITALIC"
text = text.replace("BOLD", "ITALIC");
// Add the marker for the new second inline code
text += (char)TextFragment.MARKER_CLOSING;
text += TextFragment.toChar(1);
tf.setCodedText(text, codes);

// Prints "TEXT IN <b>ITALIC</b>."
System.out.println(tf.toString());

In the code above, note the use of the TextFragment.toChar() helper method to add the index of the new inline code just after the marker. It allows you to convert a code index into its special character representation. The reverse method TextFragment.toIndex() converts a given character into a code index value.

Lastly, you can specify the list of codes along with the modified coded text, as well as a flag indicating if missing codes can be removed from the provided list of codes. For example, the code below removes all the codes and replaces the text with a new one.

// Prints "TEXT IN <b>ITALIC</b>."
System.out.println(tf.toString());

// Remove all inline codes
tf.setCodedText("Normal text.", null, true);

// Prints "Normal text."
System.out.println(tf.toString());

TagType and Marker

Each inline code is associated with a TagType information. It can be OPENING, CLOSING, or PLACEHOLDER. (It can also be SEGMENTHOLDER in some cases of segmented entries, but we will ignore this for now).

You specify this information when adding the code to the fragment:

tf.append(TagType.OPENING, "bold", "<b>");
tf.append(TagType.CLOSING, "bold", "</b>");
tf.append(TagType.PLACEHOLDER, "lb", "<br/>");

You can retrieve it later:

assert(tf.getCode(0).getTagType() == TagType.OPENING);
assert(tf.getCode(1).getTagType() == TagType.CLOSING);
assert(tf.getCode(2).getTagType() == TagType.PLACEHOLDER);

This information normally remains unchanged: The code "" is always a start tag regardless of where it is and whether or not it has a corresponding closing tag.

There is a difference however between what the tag is and how it should be represented and manipulated from the viewpoint of an extracted segment. That information is related to the position of the inline code in the text and is denoted through the kind of markers used to hold the spot of the code in the coded text. There are several markers: MARKER_OPENING, MARKER_CLOSING and MARKER_ISOLATED. (There is also a MARKER_SEGMENT used in segmented entries, but we will ignore this for now).

When a code with TagType.OPENING or TagType.CLOSING is alone in a fragment, or otherwise separated from its corresponding closing or opening counterpart, the marker is not set to MARKER_OPENING or MARKER_CLOSING, but to MARKER_ISOLATED, but it TagType remains unchanged.

For example, in the code below, the closing "" originally set with a MARKER_CLOSING, is changed to a MARKER_ISOLATED when the text is broken into two sentences in different fragments:

Normal: "First <b>bold. Second one</b>."
 Coded: "First \uE101\uE110bold. Second one\uE102\uE111."
 Codes:  0={"<b>",TagType.OPENING},
         1={"</b>",TagType.OPENING}

Normal f1: "First <b>bold. "
 Coded f1: "First \uE103\uE110" (\uE101 becomes \uE103)
 Codes f1: 0={"<b>",TagType.OPENING}

Normal f2: "Second one</b>."
 Coded f2: "Second one\uE103\uE110." (\uE102 becomes \uE103)
 Codes f2: 0={"</b>",TagType.CLOSING}