Okapi Framework - Developer's GuideGetting Started |
|
- Overview |
Applications like Rainbow are built on top of the framework, but it is quite easy to develop other tools and scripts that use directly the framework components. With these components you can do things like:
This section describes how to write simple programs that use the Okapi Framework components to perform simple tasks such as reading and modifying a document.
One of the most important actors in the framework is the
filter. All
filters are accessible through a single common API: the
IFilter
interface. They take an input document in a given format and generate
events
that give you access to the extractable text of the document, as well as the
properties associated to it.
Question: Are these events like Java AWT events, working with a listener, etc?
Answer: Not at all. In the context of the framework, an
"event" is not related to Java "listeners" or an "event sinks". They are just
the units of information (possibly with attached data) that are used to
communicate between components. In the case of
IFilter
,
events are obtained from the
IFilter.next()
method.
The filters are designed to work through a pipeline. A pipeline is a set of sequential steps, each step receiving events, processing it (if needed) and sending it to the next step. The framework has pre-defined components to build and execute pipelines easily, but we will see that later.
For now, let's concentrate on the filter itself. You do not need to use a pipeline to work with filters, as long as you know what event to send and receive.
Most events have associated resources. Each resource is different depending on the type of event. Extracted text, properties, as well as grouping information are carried in those resources. The following table show all the kinds of event that can go through a pipeline.
Events and Corresponding Resources:
Events | Resource | Filter-Specific Event? |
---|---|---|
START_BATCH |
none | no (starts a batch) |
START_BATCH_ITEM |
none | no (starts a batch item) |
RAW_DOCUMENT |
RawDocument |
no (corresponds to a document by itself). |
START_DOCUMENT |
StartDocument |
yes |
START_SUBDOCUMENT |
StartSubDocument |
yes |
START_GROUP |
StartGroup |
yes |
DOCUMENT_PART |
DocumentPart |
yes |
TEXT_UNIT |
TextUnit |
yes |
END_GROUP |
Ending |
yes |
END_SUBDOCUMENT |
Ending |
yes |
END_DOCUMENT |
Ending |
yes |
CANCELED |
none | no (used when a process is canceled) |
FINISHED |
none | no (used at the end of a set of documents) |
NO_OP |
none | no (used for no-operation) |
CUSTOM |
custom resource | no (used for custom events) |
END_BATCH_ITEM |
none | no (ends a batch item) |
END_BATCH |
none | no (ends a batch) |
A filter always generates at least the START_DOCUMENT
and END_DOCUMENT
events. All other filter-specific events may or may not be generated depending on each filter and each
document. See the section Filter Events
for more details.
Any event that is not understood by a component should be simply passed along down the pipeline.
To use a filter, you first have to create a filter object. The Okapi
Framework provides you with several filters, and you can write your own as well.
In the example below, we use the
net.sf.okapi.filters.properties.PropertiesFilter
class which implements
the
IFilter
interface for Java Properties files.
// Create a filter object IFilter filter = new PropertiesFilter();
Some filters have options that are specific to the format they parse. You can
set these filter-specific parameters with the method
IFilter.setParameters()
. The
method
IFilter.getParameters()
allows you to retrieve the current
parameters.
The PropertiesFilter
has such options, but the defaults are just
fine for the example so we won't use any additional parameters.
Question: Is there an easy way to edit the filter-specific parameters?
Answer: It depends. The parameters are accessible through the
IParameters
interface and, at the least, this interface allows to save the parameters to a
file, that file can be a simple property-like text file. Some filters also
provide UI for modifying their parameters. See the
Filter Parameters section of the
guide for more details.
The next step is to open the document. This is done with the help of an
RawDocument
object, which carries all the information the filter
needs to open the input document:
CharSequence
object that contains the document itself.
(String
is a type of CharSequence
).URI
object pointing to the physical document.InputStream
object.In the code below we set the source language to English, using the same standard language tag identifiers you would use with HTML and XML (BCP-47). We set the default encoding to UTF-8. Note that some filters will be able to automatically detect the correct encoding of the input and ignore this value.
// Creates the RawDocument object RawDocument res = new RawDocument(myInputStream, "UTF-8", LocaleId.fromString("en")); // Opens the document filter.open(res);
By default a filter generates the skeleton for the document (we'll see more about the skeleton later). If, for some reason, you do not want to generate the skeleton you can specify an additional parameter and use:
// Opens the document, without generating skeleton filter.open(res, false);
Question: Why use an intermediate object for opening the document?
Wouldn't it be simpler to just pass the parameters to
open()
?
Answer: The
RawDocument
object is used because in many cases filters are used
in the broader environment of pipelines, where having all information about the
input document in one object makes things much easier. The object is also the
resource associated with the RAW_DOCUMENT
event that allows to mix
in the same pipeline steps using filter events or steps working on the input
document directly.
Note that while the filters should do their best to implement all the
different input objects, this may not be possible in some cases
because of the format the filter deals with. For example, a filter for
Adobe Photoshop PSD files does not support CharSequence
input
objects, as PSD files are binary.
In our example we want to open a String
-based document:
// Creates the RawDocument object RawDocument res = new RawDocument("key1=Text1\nkey2=Text2", LocaleId.fromString("en")); // Opens the document filter.open(res);
This is equivalent to opening a physical properties file on the disk that has the following content:
key1=Text1 key2=Text2
Everything is now in place to start processing the input document.
We use the method
IFilter.hasNext()
to see if there is any event to access. If there is,
we use
the method IFilter.next()
to get the actual event.
IFilter.hasNext()
before calling
IFilter.next()
.IFilter.next()
once and only once before the next call to
IFilter.hasNext()
.IFilter.next()
returning null
if there are no more events.// Proper call of hasNext() and next() while ( filter.hasNext() ) { Event event = filter.next(); // Do something... }
Once you have an event, you can query its type using
Event.getEventType()
. The value retuned is one of the constants
defined in
EventType
. Then, depending on the type, you can get
the resource associated with this event (Event.getResource()
) and access the data in the resource.
After the last event for the input document (END_DOCUMENT
) has been read,
IFilter.hasNext()
returns false, and you can close the input. Note that it is a good practice for
the filters to close the input themselves before sending the last event, but you
should still call close just in case.
// Get the events from the input document while ( filter.hasNext() ) { Event event = filter.next(); // do something with the event... // Here, if the event is TEXT_UNIT, we display the key and the extracted text if ( event.getEventType() == EventType.TEXT_UNIT ) { TextUnit tu = (TextUnit)event.getResource(); System.out.println("--"); System.out.println("key=["+tu.getName()+"]"); System.out.println("text=["+tu.getSource()+"]"); } } // Close the input document filter.close();
This should generate the following output:
-- key=[key1] text=[Text1] -- key=[key2] text=[Text2]
Extracting the text parts of a document is useful, but a more useful feature the Okapi Framework offers is writing out the extracted data back into the original format.
As we have seen above, when you open a document with a filter, you can specify to generate the skeleton. The role of the skeleton is to store information about the parts of the input document that are not extractable, and to provide ways to merge back the parts that are extractable.
Because file formats are very different, they may need to use different types of skeleton mechanisms. For example, the skeleton for a binary file such as an OpenOffice.org ODT file (which is really a ZIP file) cannot be treated the same way as the skeleton of a Java properties file. The framework offers a transparent way to work with the different skeletons and lets the user ignore the underlying mechanism.
The skeleton parts are passed along with the resources of the events. A resource may or may not have an associated skeleton object.
To re-construct the original file format you need both the
extracted resources and the skeleton parts passed through the events. The framework
provides the
IFilterWriter
interface to do all this transparently.
First, you must create the filter, just like before, except we will use the HTML filter this time:
// Create a filter object IFilter filter = new HtmlFilter();
Next, you need to create an
IFilterWriter
object. You do this by calling a method of the filter itself (IFilter.createFilterWriter()
)
that provides you with the proper implementation of
IFilterWriter
for the given format it supports.
// Create the filter writer IFilterWriter writer = filter.createFilterWriter();
Once the
IFilterWriter
object is created, you need to set its options. This is done with
the
IFilterWriter.setOptions()
method.
We need to set the output language. In this case we will use French. We also need to indicate which encoding to use for the out output. In our example, we will choose Latin-1.
// Set the filter writer's options writer.setOptions(LocaleId.fromString("fr"), "iso-8859-1");
We also need to set where the output will be generated. The type of object used for output can be different from the one used for the input. For example, here we will use a string as the input document, and write the output to a physical file.
There are different methods to set the output:
OutputStream
object.URI
object pointing to a physical location.We are using the second method in this example:
// Set the output writer.setOutput("myFile_fr.html");
Note that the output document is not created when you set the output, but only when the filter will start sending events.
Question: Can the output file be the same as the input file?
Answer: Yes, you should be able to overwrite the input document. However, to ensure it will work, you should always close the input document before closing the output document.
The next step is to open the input document with the filter. This time we will use an HTML string:
// Open the input from a CharSequence filter.open(new RawDocument("<html><head>\n" + "<meta http-equiv='Content-Language' content='en'></head>\n" + "<body>\n" + "<p>Text in <b>bold</b>.</p>" + "</body></html>", LocaleId.fromString("en")));
Now that all is set, we can process the document.
Re-writing the input document is achieved simply: You call the
IFilterWriter.handleEvent()
method each time you get an event from
the filter, and then close both input and output when all events have been
processed. (Remember that you should always close the input document before the
output document, in case you are writing to the same file).
// Processing the input document while ( filter.hasNext() ) { writer.handleEvent(filter.next()); } // Closing the filter and the filter writer filter.close(); writer.close();
The code above should create a new file called myFile_fr.html
in your current directory and its content should be like this:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta http-equiv='Content-Language' content='fr'></head> <body> <p>Text in <b>bold</b>.</p></body></html>
As you can see, the filter writer make some modifications automatically: the HTML
language declaration has been updated to reflect the target language you
specified: "fr
". The remaining of the content is the same as the
input.
Obviously the real interest of the filter writer is to save changes made to the extracted text into its original format.
To perform changes in the extracted text you need to handle the
TEXT_UNIT
event, which comes with a
TextUnit
resource where the source text is stored.
It is always good practice to isolate the place where you code your changes, so
we will create a method for it. Our changeTU()
method takes one
parameter: The
TextUnit
resource provided by the TEXT_UNIT
event. The modifications are
done directly in that object.
Before we do any change, we need to check if this text unit is actually
translatable. While most of the extracted text is translatable, there are cases
where, for different reasons, the provider of the events (here a filter)
decided to protect the content of the text unit. A good example of this is the
XLIFF filter: It returns one text unit for each <trans-unit>
of the
XLIFF document, but some of those <trans-unit>
may have their
translate
attribute set to no
. The
TextUnit.isTranslatable()
method allows you to verify if a given text
unit is translatable or not, as shown below:
void changeTU (TextUnit tu) { // Check if this unit can be modified if ( !tu.isTranslatable() ) return; // If not, return without changes
Once we have established that we can modify the text, we need to create a copy of the source content for the target.
One important thing to keep in mind when working with filters is that some input documents can be multilingual (for example a PO file, or an XLIFF document). Because of that you may actually already have a target text in your text unit.
The method
TextUnit.hasTarget()
can check if a target for a given language
exists already. But there is a more convenient way to create the target
conditionally. The
TextUnit.createTarget()
is design for this. It takes several
parameters:
fr
").A flag indicating if you want to overwrite the content of a possible
existing target for that language. Set it to true
to create a
new entry even if one exists already. Set it to false
to use the
existing entry or to create a new entry if none exists.
IResource.COPY_ALL
to copy everything.TextContainer tc = tu.createTarget(LocaleId.fromString("fr"), false, IResource.COPY_ALL);
The result is a
TextContainer
object, that holds all the target-related data: text, as
well a properties, annotations, etc.
Question: Is the language code case-sensitive?
Answer: No. When a language or a locale identifier is
set to a LocaleId
object, it is normalized, so "fr
" and "FR
"
are seen as identical.
To make any modification to the content you need to work with a string of coded text. It is a string with some special characters that markup inline codes. Coded text string can usually be manipulated like a normal string, with some exceptions.
For this example, we want to convert the text to upper cases, and we can work
without problem directly with the coded text. The content is accessible for each segment and we can use the
TextFragment.getCodedText()
method. When the conversion is done you
have to set the modified string back into the
TextFragment
using the
TextFragment.setCodedText()
method.
ISegments segs = tc.getSegments(); for ( Segment seg : segs ) { TextFragment tf = seg.getContent(); tf.setCodedText(tf.getCodedText().toUpperCase()); } }
With our changeTU()
method done, we can now add it to the main
loop of the filter's event.
while ( filter.hasNext() ) { Event event = filter.next(); if ( event.getEventType() == EventType.TEXT_UNIT ) { changeTU((TextUnit)event.getResource()); } writer.handleEvent(event); } filter.close(); writer.close();
The output of our new program should look like this:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta http-equiv='Content-Language' content='fr'></head> <body> <p>TEXT IN <b>BOLD</b>.</p></body></html>
One of the most important events generated by the filters is the
TEXT_UNIT
event. It corresponds to a logical unit of extractable text of
the input document. For example, the content of a <p>
element in
HTML, or the value of a key/value pair in a Java properties file. A text unit
corresponds more or less to a <trans-unit>
element in XLIFF.
The text unit holds source and target data for the given extracted text, as well as properties (for the whole unit, for the source, as well as for each target), and annotations (also for the whole unit, for the source, as well as for each target). It is also hold its corresponding skeleton object (if there is one).
The bottom line is that you can access the source text from the text unit, as well create new translation entries or access existing ones (if the input document is multilingual).
Each language has a corresponding
TextContainer
object that holds the text as well as its associated
properties and annotations. the text itself is in a
TextFragment
object. Those parts are easily accessible from the
text unit:
TextUnit tu = new TextUnit("id1"); tu.setSourceContent(new TextFragment("My text")); TextContainer tc = tu.getSource(); TextFragment tf1 = tc.getContent(); // Or TextFragment tf2 = tu.getSourceContent();
In the example above both tf1
and tf2
point to the
same object: The source text content of the text unit.
Once you have a
TextFragment
you can manipulate it almost like a classic string:
tf1.append(' '); tf1.append("is this."); // Prints "My text is this" System.out.println(tf1.toString()); tf1.insert(3, new TextFragment("first ")); // Prints "My first text is this." System.out.println(tf1.toString()); tf1.remove(13, 21); // Prints "My first text." System.out.println(tf1.toString());
There is, however, one major difference between a
TextFragment
and a string: the inline codes.
Inline codes are spans of the extracted content that are not real text, but
codes/markup embedded in the text. They often represent formatting information. For example, in the HTML content "Text in <b>bold</b>.
"
the two tags "<b>
" and "</b>
" are inline codes.
A
TextFragment
object can contain many inline codes:
TextUnit tu = new TextUnit("id1"); TextFragment tf = tu.setSourceContent(new TextFragment("Text in ")); tf.append(TagType.OPENING, "bold", "<b>"); tf.append("bold"); tf.append(TagType.CLOSING, "bold", "</b>"); tf.append("."); // Prints "Text in <b>bold</b>." System.out.println(tf.toString());
Separating text from code allows translation tools to work in a more abstract
way. For example an HTML text
"Text in <b>bold</b>.
" can be represented the same way in a
TextFragment
. This allows better handling of the content: Improve translation memory leveraging;
comparing codes between
source and target; working with the text (e.g. spell-checking) without having
the code be in the way; and much more.
The content is separated into two parts: a coded text string where you
have the real text and special markers for each code; and the list of the codes
themselves. You can access the coded text with the
TextFragment.getCodedText()
method, and the list of codes with the
TextFragment.getCodes()
method. Most of the time simple utilities
need only to access the coded text.
String text = tf.getCodedText(); List<Code> codes = tf.getCodes();
The coded text part contains placeholders to represent the inline codes. each one is composed of two special Unicode characters:
All these special characters are in the Private Use Area of Unicode.
Normal: "Text in <b>bold</b>." Coded: "Text in \uE101\uE110bold\uE102\uE111."
The following method takes a
TextFragment
and counts the number of characters in the real
text part of the coded text. You can use the
TextFragment.isMarker()
helper method to check if a given character
is an inline code marker or not. If it is one, you need to skip the next
character as it represents the index of the inline code to the list of codes.
private static int countChars (TextFragment tf) { String text = tf.getCodedText(); int count = 0; for ( int i=0; i<text.length(); i++ ) { if ( TextFragment.isMarker(text.charAt(i)) ) i++; else count++; } return count; }
If you apply the method above to our
TextFragment
and compare it to the other length counts you get:
tf.getString().length()
= 20tf.getCodedText().length()
= 17countChars(tf)
= 13If you modify a coded text string, you need to set the modified string back
into the
TextFragment
object. This is done with one of the
TextFragment.setCodedText()
methods.
The first method sets the coded text, and re-uses the codes that are
currently in the
TextFragment
. This implies that the inline code markers in the
coded text you have modified must be unchanged. Extra or missing codes will
trigger an error.
// Prints "Text in <b>bold</b>." System.out.println(tf.toString()); String text = tf.getCodedText(); text = text.toUpperCase(); tf.setCodedText(text); // Prints "TEXT IN <b>BOLD</b>." System.out.println(tf.toString());
The second method is to set the new coded text and indicate that missing
inline code markers in your new text means the corresponding codes in the
TextFragment
should be deleted. Only extra codes will trigger an
error.
// Prints "TEXT IN <b>BOLD</b>." System.out.println(tf.toString()); text = tf.getCodedText(); text = text.substring(0, 14); // Allows the deletion of "</b>" tf.setCodedText(text, true); // Prints "TEXT IN <b>BOLD" System.out.println(tf.toString());
Question: When the "<b>
" code was
originally added to the text it was set with a TagType.OPENING
flag. Now that it does not have a corresponding closing tag, don't we have to
change its type to something else?
Answer: No. The TagType
flag remains the same
("<b>
" is still a start tag). But the marker in the coded text for
this inline code should now be MARKER_ISOLATED
instead of
MARKER_OPENING
. This change was done automatically for you when we called
setCodedText()
. We will see more information about how tag types and
markers relate to each other later.
The third method is to specify the list of codes along with the modified coded text. This allows you complete control over the inline codes. If the list of codes you provide does not match the inline codes in the coded text string it will trigger an error.
// Prints "TEXT IN <b>BOLD" System.out.println(tf.toString()); text = tf.getCodedText(); // Create a new set of codes List<Code> codes = new ArrayList<Code>(); codes.add(new Code(TagType.OPENING, "italic", "<i>")); codes.add(new Code(TagType.CLOSING, "italic", "</i>")); // Replace the text "BOLD" by "ITALIC" text = text.replace("BOLD", "ITALIC"); // Add the marker for the new second inline code text += (char)TextFragment.MARKER_CLOSING; text += TextFragment.toChar(1); tf.setCodedText(text, codes); // Prints "TEXT IN <b>ITALIC</b>." System.out.println(tf.toString());
In the code above, note the use of the
TextFragment.toChar()
helper method to add the index of the new
inline code just after the marker. It allows you to convert a code index into
its special character representation. The reverse method
TextFragment.toIndex()
converts a given character into a code index
value.
Lastly, you can specify the list of codes along with the modified coded text, as well as a flag indicating if missing codes can be removed from the provided list of codes. For example, the code below removes all the codes and replaces the text with a new one.
// Prints "TEXT IN <b>ITALIC</b>." System.out.println(tf.toString()); // Remove all inline codes tf.setCodedText("Normal text.", null, true); // Prints "Normal text." System.out.println(tf.toString());
Each inline code is associated with a TagType
information. It
can be OPENING
, CLOSING
, or PLACEHOLDER
.
(It can also be SEGMENTHOLDER
in some cases of segmented entries,
but we will ignore this for now).
You specify this information when adding the code to the fragment:
tf.append(TagType.OPENING, "bold", "<b>"); tf.append(TagType.CLOSING, "bold", "</b>"); tf.append(TagType.PLACEHOLDER, "lb", "<br/>");
You can retrieve it later:
assert(tf.getCode(0).getTagType() == TagType.OPENING); assert(tf.getCode(1).getTagType() == TagType.CLOSING); assert(tf.getCode(2).getTagType() == TagType.PLACEHOLDER);
This information normally remains unchanged: The code "<b>
" is
always a start tag regardless of where it is and whether or not it has a corresponding
closing tag.
There is a difference however between what the tag is and how it should be
represented and manipulated from the viewpoint of an extracted segment. That
information is related to the position of the inline code in the text and is
denoted through the kind of markers used to hold the spot of the code in the
coded text. There are several markers: MARKER_OPENING
,
MARKER_CLOSING
and MARKER_ISOLATED
. (There is also a
MARKER_SEGMENT
used in segmented entries, but we will ignore this for
now).
When a code with TagType.OPENING
or TagType.CLOSING
is alone in a fragment, or otherwise separated from its corresponding closing
or opening counterpart, the marker is not set to MARKER_OPENING
or
MARKER_CLOSING
, but to MARKER_ISOLATED
, but it
TagType
remains unchanged.
For example, in the code below, the closing "</b>
" originally
set with a MARKER_CLOSING
, is changed to a MARKER_ISOLATED
when the text is broken into two sentences in different fragments:
Normal: "First <b>bold. Second one</b>." Coded: "First \uE101\uE110bold. Second one\uE102\uE111." Codes: 0={"<b>",TagType.OPENING}, 1={"</b>",TagType.OPENING}
Normal f1: "First <b>bold. " Coded f1: "First \uE103\uE110" (\uE101 becomes \uE103) Codes f1: 0={"<b>",TagType.OPENING} Normal f2: "Second one</b>." Coded f2: "Second one\uE103\uE110." (\uE102 becomes \uE103) Codes f2: 0={"</b>",TagType.CLOSING}