Okapi Framework - Developer's GuideGlossary |
|
An annotation, in a resource generated from a
filter, is a piece of information associated with a
specific class. An example of annotation is the information provided in the
<alt-trans>
element of an XLIFF document. This data is mapped to an
annotation implemented with a class called
AltTransAnnotation
.
Note that annotation can also be added after extraction and used to attached various process-specific information to an object.
A coded text is a text content where inline codes have been pre-parsed and are coded with inline code markers.
The code type of a
Code
object is an indicator of what this code represent.
Several code types are pre-defined and should be used whenever possible. for
example: Code.TYPE_BOLD
, Code.TYPE_ITALIC
,
Code.TYPE_UNDERLINED
, Code.TYPE_IMAGE
, Code.TYPE_LINK
,
etc.
The code type of a code is used extensively to link opening and closing
codes: both opening and closing codes of the same pair must have the same code
type. You can use null
for code type, but then loose any pairing
mechanism.
An event is an unit of information that is carried through the pipeline. Events can be associated with physical data: the resources, and in the case of filter events, with skeleton.
RAW_DOCUMENT
event is used to pass along the
information of an input document.START_DOCUMENT
, START_SUBDOCUMENT
,
START_GROUP
, DOCUMENT_PART
, TEXT_UNIT
,
END_GROUP
, END_SUBDOCUMENT
, and END_DOCUMENT
)
are used to pass along a document broken down into translatable
and non-translatable parts.CANCELED
event is used when the the processed has been
canceled.FINISHED
event is used mark the end of a set of input
documents.NO_OP
event is used in some cases to send events that
do nothing.CUSTOM
event is used for carrying custom
resources through the pipeline.As a general rule, when a component gets an event it does not understand, it should just pass it down the pipeline, without modifying it.
A filter class is an implementation of the
IFilter
interface. Its purpose is to parse an input document and break it down into
events. A filter separates the input document into
different parts:
A filter writer class is an implementation of the
IFilterWriter
interface. Its purpose is to output the document processed
by a filter. The format of the output can be different for
each filter writer. There is normally one filter writer that is capable of
re-constructing the original format of the input document. A filter provides an
instance of such filter writer if you call the
IFilter.createFilterWriter()
method.
An inline code is some type of markup inside a run of text. For example, "<b>
"
and "</b>
" are two inline codes in the text "This is
<b>bold</b>.
" Inline codes are often used to apply formatting, but they
can be used for other things. What is an inline code and what is not depends on
each filter and sometimes on its parameters.
An inline code marker is a pair of two special Unicode characters that are inside a coded text to hold the place of an inline code. The two special characters are:
TextFragment.MARKER_OPENING
), closing (TextFragment.MARKER_CLOSING
),
or isolated (TextFragment.MARKER_ISOLATED
) code. These values are U+E101,
U+E102, and U+E103 (part of the Private Use Area of Unicode).Code
object in the list of
codes for the fragment where this code occurs. The value is the
zero-based index of the code + 57616 converted to a character: that is 0 is
U+E110, 1 is U+E111, 2 is U+E112, etc. Those characters are in the Private
Use Area of Unicode and allow for several thousands of index values.In the example below, the inline code markers for "<b>
" and "</b>
"
are highlighted. (Note for display purpose, the special characters are noted
here as \uHHHH
where HHHH
is there Unicode value, but
they are raw characters in the coded text string.
Normal text: "Text in <b>bold</b>" Coded text: "Text in \uE101\u110bold\uE102\uE111"
"..in \uE101\uE110bold..."
| |
| +--- Index of the code
|
+--- Code marker
Inline code marker are different from but related to tag
types. Normally they match (i.e. a TagType.OPENING
will be
represented by a TextFragment.MARKER_OPENING
). But there are cases
when the opening and closing codes get split (for example in a segmented text) and
the inline code marker is changed to TextFragment.MARKER_ISOLATED
while the underlying code is still of the same tag type. See the section
Tag Type and Marker for more
details.
A pipeline is a set of steps that carry out a specific process for a given list of input documents. The most common pipelines start with a step that uses a filter to parse the input document and ends with step that uses a filter writer to create the output.
See the Pipelines section for more details on pipelines.
A property, in a resource generated from a filter, is a piece of information associated with a specific name. Properties are used to give access to simple data that is not text content. There are two types of properties:
Read-only properties: Their values are extracted and accessible, but cannot be modified when the document is re-generated with the filter writer.
Modifiable properties: Their values can be changed and it is
the modified values that are output by the filter
writer. For example, the HTML Filter extracts href
attributes values to link modifiable properties.
Note that more complex information, or information that is provided after extraction, may also be associated with resources: the annotations.
A resource is an object associated with an event. It contains all the pre-parsed information the event comes with. Most events have a dedicated type of resource, but a few use the same type of resource, and some may have no corresponding resource at all.
See the Events and Corresponding Resources table for details.
All resources share a same
minimal interface:
IResource
.
Some resources also implement the
INameable
or
IReferenceable
interfaces.
A segment, in the context of the framework, is the result of a
segmentation processed applied to a extracted content, generally a
text unit. For example, it can be a single sentence, in
a text unit that is the content of an HTML <p>
element.
Segmentation services are provided through the
ISegmenter
interface.
The skeleton is the non-textual part of a document. A filter has for function to separate text content from skeleton parts. The skeleton parts are sent along with the filter events and can be used by the filter writers to construct an output in the format of the original input document.
Skeleton parts are usually left alone and simply carried through the
pipeline, but the are accessible with the
IResource.getSkeleton()
method.
The tag type of a
Code
object indicates if the inline code is a starting (TagType.OPENING
),
ending (TagType.CLOSING
),
or placeholder (TagType.PLACEHOLDER
)
code. For example a "<b>
" in HTML is a
TagType.OPENING
, a "</b>
" is a
TagType.CLOSING
, and a "<br/>
" is a
TagType.PLACEHOLDER
.
As opposed to inline code markers, tag types remain unchanged when you perform some splitting or merging operations on the coded text. This allows you to always know the real type of the inline code, regardless of its representation in the context of the segment.
See the section Tag Type and Marker for more details.
The text unit is the basic item used by a filter
to store extracted text and its associated information. It is implemented by the
TextUnit
class,
which is the resource associated with the TEXT_UNIT
event.
The text unit is at the center of the text extraction mechanism. Usually it
corresponds to something like a paragraph. For example: the content of a
<p>
element in HTML, or the text of the value of a key/value pair in a
properties file.
Often, the content of a text unit needs to be broken down into smaller parts, for example sentences. This is the segmentation process, and each resulting part is a segment.
See the section Working with Text Units for more details on text units.