Okapi Framework - Developer's Guide

Glossary

Annotation

An annotation, in a resource generated from a filter, is a piece of information associated with a specific class. An example of annotation is the information provided in the <alt-trans> element of an XLIFF document. This data is mapped to an annotation implemented with a class called AltTransAnnotation.

Note that annotation can also be added after extraction and used to attached various process-specific information to an object.

Coded text

A coded text is a text content where inline codes have been pre-parsed and are coded with inline code markers.

Code type

The code type of a Code object is an indicator of what this code represent. Several code types are pre-defined and should be used whenever possible. for example: Code.TYPE_BOLD, Code.TYPE_ITALIC, Code.TYPE_UNDERLINED, Code.TYPE_IMAGE, Code.TYPE_LINK, etc.

The code type of a code is used extensively to link opening and closing codes: both opening and closing codes of the same pair must have the same code type. You can use null for code type, but then loose any pairing mechanism.

Event

An event is an unit of information that is carried through the pipeline. Events can be associated with physical data: the resources, and in the case of filter events, with skeleton.

As a general rule, when a component gets an event it does not understand, it should just pass it down the pipeline, without modifying it.

Filter

A filter class is an implementation of the IFilter interface. Its purpose is to parse an input document and break it down into events. A filter separates the input document into different parts:

Filter writer

A filter writer class is an implementation of the IFilterWriter interface. Its purpose is to output the document processed by a filter. The format of the output can be different for each filter writer. There is normally one filter writer that is capable of re-constructing the original format of the input document. A filter provides an instance of such filter writer if you call the  IFilter.createFilterWriter() method.

Inline code

An inline code is some type of markup inside a run of text. For example, "<b>" and "</b>" are two inline codes in the text "This is <b>bold</b>." Inline codes are often used to apply formatting, but they can be used for other things. What is an inline code and what is not depends on each filter and sometimes on its parameters.

Inline code marker

An inline code marker is a pair of two special Unicode characters that are inside a coded text to hold the place of an inline code. The two special characters are:

In the example below, the inline code markers for "<b>" and "</b>" are highlighted. (Note for display purpose, the special characters are noted here as \uHHHH where HHHH is there Unicode value, but they are raw characters in the coded text string.

Normal text: "Text in <b>bold</b>"
 Coded text: "Text in \uE101\u110bold\uE102\uE111"
"..in \uE101\uE110bold..."
        |     |
        |     +--- Index of the code
        |
        +--- Code marker

Inline code marker are different from but related to tag types. Normally they match (i.e. a TagType.OPENING will be represented by a TextFragment.MARKER_OPENING). But there are cases when the opening and closing codes get split (for example in a segmented text) and the inline code marker is changed to TextFragment.MARKER_ISOLATED while the underlying code is still of the same tag type. See the section Tag Type and Marker for more details.

Pipeline

A pipeline is a set of steps that carry out a specific process for a given list of input documents. The most common pipelines start with a step that uses a filter to parse the input document and ends with step that uses a filter writer to create the output.

See the Pipelines section for more details on pipelines.

Properties

A property, in a resource generated from a filter, is a piece of information associated with a specific name. Properties are used to give access to simple data that is not text content. There are two types of properties:

Note that more complex information, or information that is provided after extraction, may also be associated with resources: the annotations.

Resource

A resource is an object associated with an event. It contains all the pre-parsed information the event comes with. Most events have a dedicated type of resource, but a few use the same type of resource, and some may have no corresponding resource at all.

See the Events and Corresponding Resources table for details.

All resources share a same minimal interface: IResource. Some resources also implement the INameable or IReferenceable interfaces.

Segment

A segment, in the context of the framework, is the result of a segmentation processed applied to a extracted content, generally a text unit. For example, it can be a single sentence, in a text unit that is the content of an HTML <p> element. Segmentation services are provided through the ISegmenter interface.

Skeleton

The skeleton is the non-textual part of a document. A filter has for function to separate text content from skeleton parts. The skeleton parts are sent along with the filter events and can be used by the filter writers to construct an output in the format of the original input document.

Skeleton parts are usually left alone and simply carried through the pipeline, but the are accessible with the IResource.getSkeleton() method.

Tag type

The tag type of a Code object indicates if the inline code is a starting (TagType.OPENING), ending (TagType.CLOSING), or placeholder (TagType.PLACEHOLDER) code. For example a "<b>" in HTML is a TagType.OPENING, a "</b>" is a TagType.CLOSING, and a "<br/>" is a TagType.PLACEHOLDER.

As opposed to inline code markers, tag types remain unchanged when you perform some splitting or merging operations on the coded text. This allows you to always know the real type of the inline code, regardless of its representation in the context of the segment.

See the section Tag Type and Marker for more details.

Text unit

The text unit is the basic item used by a filter to store extracted text and its associated information. It is implemented by the TextUnit class, which is the resource associated with the TEXT_UNIT event. The text unit is at the center of the text extraction mechanism. Usually it corresponds to something like a paragraph. For example: the content of a <p> element in HTML, or the text of the value of a key/value pair in a properties file.

Often, the content of a text unit needs to be broken down into smaller parts, for example sentences. This is the segmentation process, and each resulting part is a segment.

See the section Working with Text Units for more details on text units.