Okapi Framework - Developer's Guide

Filters

- Filter Events
- Languages
- Encodings
- Line-Breaks
- Filter Parameters

Filter Events

A filter sends at least two events: START_DOCUMENT and END_DOCUMENT. All other filter events may or may not be send depending on the filter and the input document. The possible sequence of filter events can be expressed as follow:

         FilterEvents ::= START_DOCUMENT, DocumentContentEvents, END_DOCUMENT
DocumentContentEvents ::= (SubDocumentEvents | SimpleEvents)*
    SubDocumentEvents ::= START_SUBDOCUMENT, SimpleEvents, END_SUBDOCUMENT
         SimpleEvents ::= (GroupEvents | TEXT_UNIT | DOCUMENT_PART)*
          GroupEvents ::= START_GROUP, SimpleEvents, END_GROUP

START_DOCUMENT

The START_DOCUMENT is sent as the first event for the document. It is associated with a StartDocument resource that contains general information about the document. Each StartDocument resource is expected to have at least the following information:

getName() gives name of the full path or URI of the document, if possible.
getEncoding() gives the encoding that is being used to read the document. It may be different from the default encoding provided when the document is opened, for example when the filter can detect automatically the real encoding of the document.
hasUTF8BOM() indicates if the input document is encoded in UTF-8 and has a Byte-Order-Mark.
getLocale() gives the code of the source locale of the document (the same as the one specified when opening the document).
isMultilingual() indicates if the document is multilingual (See more on monolingual and multilingual in the Languages section).
getLineBreak() gives the type of line-break used in the document.
getFilterParameters() gives the parameters used for processing this document (including if they are the default parameters). It may return null if the filter does not use parameters at all.
getFilter() gives the filter that is being used to parse this document.

Each START_DOCUMENT must have a corresponding END_DOCUMENT event sent as the last event for this document.

END_DOCUMENT

The END_DOCUMENT event is sent to close a previous START_DOCUMENT event. It is associated with a Ending resource.

START_SUBDOCUMENT

The START_SUBDOCUMENT may be sent by a filter when the input document is composed of several separate logical parts. For example, an IDML document (InDesign file) is really a ZIP file that may contain dozens of different XML documents (stories) that may have translatable text: each one is a sub-document. Another example is an XLIFF document: It may be composed of several <file> elements, each corresponding to a separate sub-document.

END_SUBDOCUMENT

The END_SUBDOCUMENT event is sent to close a previous START_SUBDOCUMENT event. It is associated with a Ending resource.

START_GROUP

The START_GROUP event may be sent by a filter to indicate the start of some logical grouping of events, for example to indicate the start of a <table>, a <script>, or a <style> element in an HTML document, or a dialog box in a Windows RC file.

A group may contain other groups. Each START_GROUP must have a corresponding END_GROUP event, and groups must not overlap. It is associated with a StartGroup resource.

END_GROUP

The END_GROUP event is sent to close a previous START_GROUP event. It is associated with a Ending resource.

DOCUMENT_PART

The DOCUMENT_PART event may be sent by a filter to carry parts of the original document that do not contain directly translatable text. It is associated with a DocumentPart resource. Note that a DocumentPart may have read-only or modifiable properties, and may have references to previous events that have translatable text or other read-only or modifiable properties. All translatable text is always passed through using the TEXT_UNIT event.

TEXT_UNIT

The TEXT_UNIT event may be sent by a filter to carry parts of the original document that has extractable text. Note that a TextUnit may have read-only or modifiable properties, and may have references to previous events that have translatable text or other read-only or modifiable properties.

A TextUnit resource provides various information related to the extracted text:

getName() gives the original identified the resource, for example the key of entry in a Java properties file. If the text unit has no such identifier, this method should return null.
getId() gives the unique extraction-ID for this text unit. This value is filter specific and is only meaningful for the filter. It can be sequential or not, continue or not, numbers or names, basically anything. Some filters may return the same values for getId() and getName(), but both object represent different things.
isTranslatable() indicates if this text unit is translatable.
getMimeType() gives the type of content the text unit contains.
preserveWhitespaces() indicates if the white spaces inside the content of the text unit must be preserved (for example, as the content of a HTML <pre> element).

Filters should make sure that the text units they create can have text added or removed anywhere within the text unit, including before the first inline code and after the last inline code.

Languages

The documents a filter processes can be monolingual or multilingual. The filter is responsible for knowing if it processes a monolingual or a multilingual document.

Note: Locale and Languages. Nowadays there is not many difference between a language code and a locale code, as the new language tags of the BCP-47 includes sub-tags that represent various regional or special variants, as well as script difference. For example, ES-419 stands for Spanish for Latin-America and the Caribbean, zh-Hant-tw for Traditional Chinese used in Taiwan, etc. For more information about BCP-47 see http://www.w3.org/International/articles/bcp47. The terms locale and language are sometimes used interchangeably in this documentation.

Monolingual Documents

Monolingual documents have their content in a single main language/locale. In such document any target data replaces the source data. Examples of monolingual documents are Java properties file and OpenDocument files. Note that such documents may contain text in different languages (like citations), but their structure is designed to have a single main language.

Before starting to send events:

The caller of the filter must set the source language when opening the document.

When sending the START_DOCUMENT event:

The filter must indicate the source locale of the document in the StartDocument resource, using StartDocument.setLocale(). The caller of the filter can retrieve the source locale by calling StartDocument.getLocale().

At any time when sending an event:

The filter should try to capture any information where the source language is defined and create a modifiable property for it. This will allow the writer to update the language settings of the output to the target language. The filter should create read-only properties for language information that are not source language.

Multilingual Documents

Multilingual documents have their content in several languages. They have a structure that is designed to hold the same content in different languages. The actual input document may contain only the source language content, but any target data will be placed along with the source data rather than over it. In such documents the source data is not overwritten by the target data. Examples of multilingual documents are PO files and XLIFF files.

Before starting to send events:

The caller of the filter must set the source and the target languages when opening the document.

When the filter sends the START_DOCUMENT event:

The filter must indicate that the document is multilingual in the StartDocument resource using StartDocument.setMultilingual(). The caller of the filter can retrieve that information using the StartDocument.isMultilingual().
The filter must indicate the source locale in the StartDocument resource using StartDocument.setLocale(). The caller of the filter can retrieve the source locale by calling StartDocument.getLocale().

At any time when sending an event:

The filter should try to capture any information where the source locale is defined and create a modifiable property for it. This will allow the writer to update the locale settings of the output to the target language. The filter should create read-only properties for locale information that are not source locale.

Encodings

Input

The filter is responsible for detecting the encoding of the input document when possible. If the encoding cannot be detected, the filter should use the default encoding the caller has provided when opening the document.

If the input encoding is UTF-8. The filter must handle the possible presence of a Byte-Order-Mark, and if it exists, not include it in the content of the document.

When sending the START_DOCUMENT event:

The filter must set the encoding used for the input in the StartDocument resource. If the encoding is UTF-8, the filter must also indicate whether the original document has a BOM or not. If the encoding is not UTF-8 that indicator must be set to false (including for other UTF encodings). Both parameters are set using the StartDocument.setEncoding() method.

Output

The encoding of an output generate by a writer (IFilterWriter object) is not necessarily the same as the input encoding. The encoding to use is set by the caller through the IFilterWriter.setOption() method. If the output encoding specified by that method is null, the filter should use the same encoding as the input document.

The writer should create the output document only when or after receiving the START_DOCUMENT event. The StartDocument resource contains information such as:

The encoding of the input document.
A flag set to true if the input encoding was UTF-8 and was using a Byte-Order-Mark, set to false in all other cases.

Line-Breaks

Input

The filter is responsible for detecting the type of line-break of the input document when possible. If the type of line-break cannot be detected (for example the input document has no line-breaks), the filter should assume the type of line-break to use is the one of the current platform.

All line-breaks passed in resources objects, for example the content of a TextUnit object, must be standardized to a single line-feed character ("\n").
All line-breaks inside skeleton objects must be in the same type as the input (i.e. remain unchanged).

Output

The writer must not change the type of line-break of a document. The original type of line-break is available in the StartDocument resource (StartDocument.getLineBreak()).

The reason for always using the original line-breaks in the output when using the filters is that the writer cannot change the line-breaks in the skeleton parts, so if different types of line-breaks are used in the skeleton and in the extracted content the output would end up with a mix of line-break types.

he TextFragment class offers methods to manipulate the text and codes.

Filter Parameters

A filter may have specific parameters associated to it. These parameters are specific to each filter and indicate various specialized processing options.

A filter that has parameters must always generate a default set of parameters upon creation. The IFilter.getParameters() method may be called at any time after the object has been created.

The filter parameters must be accessible as object that implements the interface IParameters. The methods IFilter.getParameters() and IFilter.setParameters() methods allow you to set and get the parameters of each filter.

Some filter may have a way to edit the parameters in a dialog box, through the IParametersEditor interface. If such editor does not exists for a given filter, the file where the parameters are stored must be accessible through a simply text file editor. That is why the format for storing filter parameters must be some kind of text-based format (e.g. properties file).

The code below shows how to use the filter parameters:

// Create the filter we will use
IFilter filter = new PropertiesFilter();
// Get the default parameters
IParameters params = filter.getParameters();
if ( params != null ) {
   params.save("defaultParameters.txt");
   // Create the editor class
   IParametersEditor editor = new net.sf.okapi.filters.ui.properties.Editor();
   if ( editor.edit(params, null, null, null) ) {
      params.save("editedParameters.txt");
   }
}

Note that the editor classes may be in a different package than the filter itself, as the UI may be platform-dependent while the filter is not.