Okapi Framework - Developer's Guide
An annotation, in a resource generated from a
filter, is a piece of information associated with a
specific class. An example of annotation is the information provided in the
<alt-trans> element of an XLIFF document. This data is mapped to an
annotation implemented with a class called
Note that annotation can also be added after extraction and used to attached various process-specific information to an object.
A coded text is a text content where inline codes have been pre-parsed and are coded with inline code markers.
The code type of a
Code object is an indicator of what this code represent.
Several code types are pre-defined and should be used whenever possible. for
The code type of a code is used extensively to link opening and closing
codes: both opening and closing codes of the same pair must have the same code
type. You can use
null for code type, but then loose any pairing
An event is an unit of information that is carried through the pipeline. Events can be associated with physical data: the resources, and in the case of filter events, with skeleton.
RAW_DOCUMENTevent is used to pass along the information of an input document.
END_DOCUMENT) are used to pass along a document broken down into translatable and non-translatable parts.
CANCELEDevent is used when the the processed has been canceled.
FINISHEDevent is used mark the end of a set of input documents.
NO_OPevent is used in some cases to send events that do nothing.
CUSTOMevent is used for carrying custom resources through the pipeline.
As a general rule, when a component gets an event it does not understand, it should just pass it down the pipeline, without modifying it.
A filter class is an implementation of the
interface. Its purpose is to parse an input document and break it down into
events. A filter separates the input document into
A filter writer class is an implementation of the
IFilterWriter interface. Its purpose is to output the document processed
by a filter. The format of the output can be different for
each filter writer. There is normally one filter writer that is capable of
re-constructing the original format of the input document. A filter provides an
instance of such filter writer if you call the
An inline code is some type of markup inside a run of text. For example, "
</b>" are two inline codes in the text "
<b>bold</b>." Inline codes are often used to apply formatting, but they
can be used for other things. What is an inline code and what is not depends on
each filter and sometimes on its parameters.
An inline code marker is a pair of two special Unicode characters that are inside a coded text to hold the place of an inline code. The two special characters are:
TextFragment.MARKER_OPENING), closing (
TextFragment.MARKER_CLOSING), or isolated (
TextFragment.MARKER_ISOLATED) code. These values are U+E101, U+E102, and U+E103 (part of the Private Use Area of Unicode).
Codeobject in the list of codes for the fragment where this code occurs. The value is the zero-based index of the code + 57616 converted to a character: that is 0 is U+E110, 1 is U+E111, 2 is U+E112, etc. Those characters are in the Private Use Area of Unicode and allow for several thousands of index values.
In the example below, the inline code markers for "
<b>" and "
are highlighted. (Note for display purpose, the special characters are noted
HHHH is there Unicode value, but
they are raw characters in the coded text string.
Normal text: "Text in <b>bold</b>" Coded text: "Text in \uE101\u110bold\uE102\uE111"
"..in \uE101\uE110bold..." | | | +--- Index of the code | +--- Code marker
Inline code marker are different from but related to tag
types. Normally they match (i.e. a
TagType.OPENING will be
represented by a
TextFragment.MARKER_OPENING). But there are cases
when the opening and closing codes get split (for example in a segmented text) and
the inline code marker is changed to
while the underlying code is still of the same tag type. See the section
Tag Type and Marker for more
A pipeline is a set of steps that carry out a specific process for a given list of input documents. The most common pipelines start with a step that uses a filter to parse the input document and ends with step that uses a filter writer to create the output.
See the Pipelines section for more details on pipelines.
A property, in a resource generated from a filter, is a piece of information associated with a specific name. Properties are used to give access to simple data that is not text content. There are two types of properties:
Read-only properties: Their values are extracted and accessible, but cannot be modified when the document is re-generated with the filter writer.
Modifiable properties: Their values can be changed and it is
the modified values that are output by the filter
writer. For example, the HTML Filter extracts
attributes values to link modifiable properties.
Note that more complex information, or information that is provided after extraction, may also be associated with resources: the annotations.
A resource is an object associated with an event. It contains all the pre-parsed information the event comes with. Most events have a dedicated type of resource, but a few use the same type of resource, and some may have no corresponding resource at all.
See the Events and Corresponding Resources table for details.
All resources share a same
Some resources also implement the
A segment, in the context of the framework, is the result of a
segmentation processed applied to a extracted content, generally a
text unit. For example, it can be a single sentence, in
a text unit that is the content of an HTML
Segmentation services are provided through the
The skeleton is the non-textual part of a document. A filter has for function to separate text content from skeleton parts. The skeleton parts are sent along with the filter events and can be used by the filter writers to construct an output in the format of the original input document.
Skeleton parts are usually left alone and simply carried through the
pipeline, but the are accessible with the
The tag type of a
object indicates if the inline code is a starting (
or placeholder (
code. For example a "
<b>" in HTML is a
TagType.OPENING, a "
</b>" is a
TagType.CLOSING, and a "
<br/>" is a
As opposed to inline code markers, tag types remain unchanged when you perform some splitting or merging operations on the coded text. This allows you to always know the real type of the inline code, regardless of its representation in the context of the segment.
See the section Tag Type and Marker for more details.
The text unit is the basic item used by a filter
to store extracted text and its associated information. It is implemented by the
which is the resource associated with the
The text unit is at the center of the text extraction mechanism. Usually it
corresponds to something like a paragraph. For example: the content of a
<p> element in HTML, or the text of the value of a key/value pair in a
Often, the content of a text unit needs to be broken down into smaller parts, for example sentences. This is the segmentation process, and each resulting part is a segment.
See the section Working with Text Units for more details on text units.