This page defines the main term and concepts used across the documentation and help of the Okapi Framework.
A coded text is a text content where inline codes have been pre-parsed and separated from the text, replaced with place-holders.
An event is an unit of information that is carried through the pipeline. Many events are associated with corresponding physical data: the resources. For example, the
TEXT_UNIT event carries a text unit resource.
A filter is a component that separate an input document into different parts, some of which are translatable text or other localizable data. The filter generates a set of events that can be processed by other components.
See the "Filters" page for a list of filters available in the framework, and the formats supported.
An inline code is some type of markup inside a run of text. For example, "
<b>" and "
</b>" are two inline codes in the text "
This is <b>bold</b>." Inline codes are often used to apply formatting, but they can be used for other things. What is an inline code and what is not depends on each filter and sometimes on its parameters.
A pipeline is a set of steps that carry out a specific process for a given list of input documents. The most common pipelines start with a step that uses a filter to parse the input document and ends with step that uses a filter writer to create the output.
In this documentation, pipeline are usually represented like this:
A property, in a resource generated from a filter, is a piece of information associated with a specific name. Properties are used to give access to simple data that is not text content. There are two types of properties:
- Read-only properties: Their values are extracted and accessible, but cannot be modified when the document is re-generated with the filter writer.
- Modifiable properties: Their values can be changed and it is the modified values that are output by the filter writer. For example, the HTML Filter extracts
hrefattributes values to link modifiable properties.
A resource is an object associated with an event. It contains all the pre-parsed information the event comes with. Most events have a dedicated type of resource, but a few use the same type of resource, and some may have no corresponding resource at all.
All resources share a same minimal interface:
IResource. Some resources also implement the
A segment, in the context of the framework, is the result of a segmentation processed applied to a extracted content, generally a text unit. For example, it can be a single sentence, in a text unit that is the content of an HTML
<p> element. Segmentation services are provided through the
The skeleton is the non-textual part of a document. A filter has for function to separate text content from skeleton parts. The skeleton parts are sent along with the filter events and can be used by the filter writers to construct an output in the format of the original input document.
Skeleton parts are usually left alone and simply carried through the pipeline, but the are accessible with the
The text unit is the basic item used by a filter to store extracted text and its associated information. It is at the center of the text extraction mechanism. Usually it corresponds to something like a paragraph. For example: the content of a
<p> element in HTML, or the text of the value of a key/value pair in a properties file. In corresponds roughly to a
<trans-unit> in XLIFF 1.2.
Often, the content of a text unit needs to be broken down into smaller parts, for example sentences. This is the segmentation process, and each resulting part is a segment.