Okapi Framework - Developer's Guide

Filters

- Filter Events
- Languages
- Encodings
- Line-Breaks
- Filter Parameters

Filter Events

A filter sends at least two events: START_DOCUMENT and END_DOCUMENT. All other filter events may or may not be send depending on the filter and the input document. The possible sequence of filter events can be expressed as follow:

         FilterEvents ::= START_DOCUMENT, DocumentContentEvents, END_DOCUMENT
DocumentContentEvents ::= (SubDocumentEvents | SimpleEvents)*
    SubDocumentEvents ::= START_SUBDOCUMENT, SimpleEvents, END_SUBDOCUMENT
         SimpleEvents ::= (GroupEvents | TEXT_UNIT | DOCUMENT_PART)*
          GroupEvents ::= START_GROUP, SimpleEvents, END_GROUP

START_DOCUMENT

The START_DOCUMENT is sent as the first event for the document. It is associated with a StartDocument resource that contains general information about the document. Each StartDocument resource is expected to have at least the following information:

Each START_DOCUMENT must have a corresponding END_DOCUMENT event sent as the last event for this document.

END_DOCUMENT

The END_DOCUMENT event is sent to close a previous START_DOCUMENT event. It is associated with a Ending resource.

START_SUBDOCUMENT

The START_SUBDOCUMENT may be sent by a filter when the input document is composed of several separate logical parts. For example, an IDML document (InDesign file) is really a ZIP file that may contain dozens of different XML documents (stories) that may have translatable text: each one is a sub-document. Another example is an XLIFF document: It may be composed of several <file> elements, each corresponding to a separate sub-document.

END_SUBDOCUMENT

The END_SUBDOCUMENT event is sent to close a previous START_SUBDOCUMENT event. It is associated with a Ending resource.

START_GROUP

The START_GROUP event may be sent by a filter to indicate the start of some logical grouping of events, for example to indicate the start of a <table>, a <script>, or a <style> element in an HTML document, or a dialog box in a Windows RC file.

A group may contain other groups. Each START_GROUP must have a corresponding END_GROUP event, and groups must not overlap. It is associated with a StartGroup resource.

END_GROUP

The END_GROUP event is sent to close a previous START_GROUP event. It is associated with a Ending resource.

DOCUMENT_PART

The DOCUMENT_PART event may be sent by a filter to carry parts of the original document that do not contain directly translatable text. It is associated with a DocumentPart resource. Note that a DocumentPart may have read-only or modifiable properties, and may have references to previous events that have translatable text or other read-only or modifiable properties. All translatable text is always passed through using the TEXT_UNIT event.

TEXT_UNIT

The TEXT_UNIT event may be sent by a filter to carry parts of the original document that has extractable text. Note that a TextUnit may have read-only or modifiable properties, and may have references to previous events that have translatable text or other read-only or modifiable properties.

A TextUnit resource provides various information related to the extracted text:

Filters should make sure that the text units they create can have text added or removed anywhere within the text unit, including before the first inline code and after the last inline code.

Languages

The documents a filter processes can be monolingual or multilingual. The filter is responsible for knowing if it processes a monolingual or a multilingual document.

Note: Locale and Languages. Nowadays there is not many difference between a language code and a locale code, as the new language tags of the BCP-47 includes sub-tags that represent various regional or special variants, as well as script difference. For example, ES-419 stands for Spanish for Latin-America and the Caribbean, zh-Hant-tw for Traditional Chinese used in Taiwan, etc. For more information about BCP-47 see http://www.w3.org/International/articles/bcp47. The terms locale and language are sometimes used interchangeably in this documentation.

Monolingual Documents

Monolingual documents have their content in a single main language/locale. In such document any target data replaces the source data. Examples of monolingual documents are Java properties file and OpenDocument files. Note that such documents may contain text in different languages (like citations), but their structure is designed to have a single main language.

Before starting to send events:

When sending the START_DOCUMENT event:

At any time when sending an event:

Multilingual Documents

Multilingual documents have their content in several languages. They have a structure that is designed to hold the same content in different languages. The actual input document may contain only the source language content, but any target data will be placed along with the source data rather than over it. In such documents the source data is not overwritten by the target data. Examples of multilingual documents are PO files and XLIFF files.

Before starting to send events:

When the filter sends the START_DOCUMENT event:

At any time when sending an event:

Encodings

Input

The filter is responsible for detecting the encoding of the input document when possible. If the encoding cannot be detected, the filter should use the default encoding the caller has provided when opening the document.

If the input encoding is UTF-8. The filter must handle the possible presence of a Byte-Order-Mark, and if it exists, not include it in the content of the document.

When sending the START_DOCUMENT event:

Output

The encoding of an output generate by a writer (IFilterWriter object) is not necessarily the same as the input encoding. The encoding to use is set by the caller through the IFilterWriter.setOption() method. If the output encoding specified by that method is null, the filter should use the same encoding as the input document.

The writer should create the output document only when or after receiving the START_DOCUMENT event. The StartDocument resource contains information such as:

Line-Breaks

Input

The filter is responsible for detecting the type of line-break of the input document when possible. If the type of line-break cannot be detected (for example the input document has no line-breaks), the filter should assume the type of line-break to use is the one of the current platform.

Output

The writer must not change the type of line-break of a document. The original type of line-break is available in the StartDocument resource (StartDocument.getLineBreak()).

The reason for always using the original line-breaks in the output when using the filters is that the writer cannot change the line-breaks in the skeleton parts, so if different types of line-breaks are used in the skeleton and in the extracted content the output would end up with a mix of line-break types.

he TextFragment class offers methods to manipulate the text and codes.

Filter Parameters

A filter may have specific parameters associated to it. These parameters are specific to each filter and indicate various specialized processing options.

A filter that has parameters must always generate a default set of parameters upon creation. The IFilter.getParameters() method may be called at any time after the object has been created.

The filter parameters must be accessible as object that implements the interface IParameters. The methods IFilter.getParameters() and IFilter.setParameters() methods allow you to set and get the parameters of each filter.

Some filter may have a way to edit the parameters in a dialog box, through the IParametersEditor interface. If such editor does not exists for a given filter, the file where the parameters are stored must be accessible through a simply text file editor. That is why the format for storing filter parameters must be some kind of text-based format (e.g. properties file).

The code below shows how to use the filter parameters:

// Create the filter we will use
IFilter filter = new PropertiesFilter();
// Get the default parameters
IParameters params = filter.getParameters();
if ( params != null ) {
   params.save("defaultParameters.txt");
   // Create the editor class
   IParametersEditor editor = new net.sf.okapi.filters.ui.properties.Editor();
   if ( editor.edit(params, null, null, null) ) {
      params.save("editedParameters.txt");
   }
}

Note that the editor classes may be in a different package than the filter itself, as the UI may be platform-dependent while the filter is not.