Okapi Framework - Developer's GuideFilters |
|
- Filter Events |
A filter sends at least two events: START_DOCUMENT
and
END_DOCUMENT
. All other filter events may or may not be send depending on
the filter and the input document. The possible sequence of filter events can be
expressed as follow:
FilterEvents ::= START_DOCUMENT, DocumentContentEvents, END_DOCUMENT DocumentContentEvents ::= (SubDocumentEvents | SimpleEvents)* SubDocumentEvents ::= START_SUBDOCUMENT, SimpleEvents, END_SUBDOCUMENT SimpleEvents ::= (GroupEvents | TEXT_UNIT | DOCUMENT_PART)* GroupEvents ::= START_GROUP, SimpleEvents, END_GROUP
The START_DOCUMENT
is sent as the first event for the document.
It is associated with a
StartDocument
resource that contains general information about the
document. Each
StartDocument
resource is expected to have at least the following
information:
getName()
gives name of the full path or URI of the document, if possible.
getEncoding()
gives the encoding that is being used to read the
document. It may be different from the default encoding provided when the
document is opened, for example when the filter can detect automatically the real
encoding of the document.
hasUTF8BOM()
indicates if the input document is encoded in UTF-8
and has a Byte-Order-Mark.
getLocale()
gives the code of the source locale of the
document (the same as the one specified when opening the document).
isMultilingual()
indicates if the document is multilingual (See
more on monolingual and multilingual in
the Languages section).
getLineBreak()
gives the type of line-break used in the document.
getFilterParameters()
gives the parameters used for processing
this document (including if they are the default parameters). It may return null
if the filter does not use parameters at all.
getFilter()
gives the filter that is being used to parse this
document.Each START_DOCUMENT
must have a corresponding END_DOCUMENT
event sent as the last event for this document.
The END_DOCUMENT
event is sent to close a previous
START_DOCUMENT
event. It is associated with a
Ending
resource.
The START_SUBDOCUMENT
may be sent by a filter when the input
document is composed of several separate logical parts. For example, an IDML
document (InDesign file) is really a ZIP file that may contain dozens of
different XML documents (stories) that may have translatable text: each one is a
sub-document. Another example is an XLIFF document: It may be composed of
several <file>
elements, each corresponding to a separate
sub-document.
The END_SUBDOCUMENT
event is sent to close a previous
START_SUBDOCUMENT
event. It is associated with a
Ending
resource.
The START_GROUP
event may be sent by a filter to
indicate the start of some logical grouping of events, for example to indicate
the start of a <table>
, a <script>
, or a <style>
element in an HTML document, or a dialog box in a Windows RC file.
A group may contain other groups. Each START_GROUP
must have a
corresponding END_GROUP
event, and groups must not overlap. It is
associated with a
StartGroup
resource.
The END_GROUP
event is sent to close a previous
START_GROUP
event. It is associated with a
Ending
resource.
The DOCUMENT_PART
event may be sent by a filter to carry parts
of the original document that do not contain directly translatable text. It is
associated with a
DocumentPart
resource. Note that a
DocumentPart
may have read-only or modifiable properties, and may have references to previous
events that have translatable text or other read-only or modifiable properties.
All translatable text is always passed through using the TEXT_UNIT
event.
The TEXT_UNIT
event may be sent by a filter to carry parts of
the original document that has extractable text. Note that a
TextUnit
may have read-only or modifiable properties, and may have references to previous
events that have translatable text or other read-only or modifiable properties.
A
TextUnit
resource provides various information related to the
extracted text:
getName()
gives the original identified the resource, for example
the key of entry in a Java properties file. If the text unit has no such
identifier, this method should return null
.
getId()
gives the unique extraction-ID for this text unit. This
value is filter specific and is only meaningful for the filter. It can be
sequential or not, continue or not, numbers or names, basically anything. Some filters may return the same values for
getId()
and
getName()
, but both object represent different things.
isTranslatable()
indicates if this text unit is translatable.
getMimeType()
gives the type of content the text unit contains.
preserveWhitespaces()
indicates if the white spaces inside the
content of the text unit must be preserved (for example, as the content of a
HTML <pre>
element).Filters should make sure that the text units they create can have text added or removed anywhere within the text unit, including before the first inline code and after the last inline code.
The documents a filter processes can be monolingual or multilingual. The filter is responsible for knowing if it processes a monolingual or a multilingual document.
Note: Locale and Languages. Nowadays there is not many difference
between a language code and a locale code, as the new language tags of the
BCP-47 includes sub-tags that represent various regional or special
variants, as well as script difference. For example, ES-419
stands for Spanish for Latin-America and the Caribbean, zh-Hant-tw
for Traditional
Chinese used in Taiwan, etc. For more information about BCP-47 see
http://www.w3.org/International/articles/bcp47. The terms locale
and language are sometimes used interchangeably in this
documentation.
Monolingual documents have their content in a single main language/locale. In such document any target data replaces the source data. Examples of monolingual documents are Java properties file and OpenDocument files. Note that such documents may contain text in different languages (like citations), but their structure is designed to have a single main language.
Before starting to send events:
When sending the START_DOCUMENT
event:
StartDocument
resource, using
StartDocument.setLocale()
. The caller of the
filter can retrieve the source locale by calling
StartDocument.getLocale()
.At any time when sending an event:
Multilingual documents have their content in several languages. They have a structure that is designed to hold the same content in different languages. The actual input document may contain only the source language content, but any target data will be placed along with the source data rather than over it. In such documents the source data is not overwritten by the target data. Examples of multilingual documents are PO files and XLIFF files.
Before starting to send events:
When the filter sends the START_DOCUMENT
event:
StartDocument
resource using
StartDocument.setMultilingual()
.
The caller of the filter can retrieve that information using the
StartDocument.isMultilingual()
.
StartDocument
resource using
StartDocument.setLocale()
.
The caller of the filter can retrieve the source locale by calling
StartDocument.getLocale()
.At any time when sending an event:
The filter is responsible for detecting the encoding of the input document when possible. If the encoding cannot be detected, the filter should use the default encoding the caller has provided when opening the document.
If the input encoding is UTF-8. The filter must handle the possible presence of a Byte-Order-Mark, and if it exists, not include it in the content of the document.
When sending the START_DOCUMENT
event:
The filter must
set the encoding used for the input in the
StartDocument
resource.
If the encoding is UTF-8, the filter must also indicate whether the original
document has a BOM or not. If the encoding is not UTF-8 that indicator must be
set to false (including for other UTF encodings). Both parameters are set using the
StartDocument.setEncoding()
method.
The encoding of an output generate by a writer (IFilterWriter
object) is
not necessarily the same as the input encoding. The encoding to use is set by
the caller through the
IFilterWriter.setOption()
method. If the output encoding specified by that method is
null
, the
filter should use the same encoding as the input document.
The writer should create the output document only when
or after receiving the START_DOCUMENT
event. The
StartDocument
resource contains information such as:
The filter is responsible for detecting the type of line-break of the input document when possible. If the type of line-break cannot be detected (for example the input document has no line-breaks), the filter should assume the type of line-break to use is the one of the current platform.
TextUnit
object, must be standardized to a single line-feed
character ("\n
").The writer must not change the type of line-break of a document. The original
type of line-break is available in the
StartDocument
resource (StartDocument.getLineBreak()
).
The reason for always using the original line-breaks in the output when using the filters is that the writer cannot change the line-breaks in the skeleton parts, so if different types of line-breaks are used in the skeleton and in the extracted content the output would end up with a mix of line-break types.
he TextFragment
class offers methods to manipulate the text and codes.
A filter may have specific parameters associated to it. These parameters are specific to each filter and indicate various specialized processing options.
A filter that has parameters must always generate a default set of parameters
upon creation. The
IFilter.getParameters()
method may be called at
any time after the object has been created.
The filter parameters must be accessible as object that implements
the interface IParameters
. The methods
IFilter.getParameters()
and
IFilter.setParameters()
methods allow you to set and get the
parameters of each filter.
Some filter may have a way to edit the parameters in a dialog box,
through the
IParametersEditor
interface. If such editor does not
exists for a given filter, the file where the parameters are stored must be
accessible through a simply text file editor. That is why the format for storing
filter parameters must be some kind of text-based format (e.g. properties file).
The code below shows how to use the filter parameters:
// Create the filter we will use IFilter filter = new PropertiesFilter(); // Get the default parameters IParameters params = filter.getParameters(); if ( params != null ) { params.save("defaultParameters.txt"); // Create the editor class IParametersEditor editor = new net.sf.okapi.filters.ui.properties.Editor(); if ( editor.edit(params, null, null, null) ) { params.save("editedParameters.txt"); } }
Note that the editor classes may be in a different package than the filter itself, as the UI may be platform-dependent while the filter is not.