TXML Filter

From Okapi Framework
Jump to: navigation, search

Overview

The TXML Filter is an Okapi component that implements the IFilter interface for Wordfast Pro TXML documents. TXML is a proprietary XML-based bilingual format used by Wordfast Pro and supported by several other tools. There are no official public specifications available.

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the following logic:

  • If the document has an encoding declaration it is used.
  • Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).

Output Encoding

If the output encoding is UTF-8:

  • If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
  • If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

If the original document had an XML encoding declaration it is updated, if it did not, one is automatically added.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Segmentation

TXML files are organized into blocks of one or more segments. So they are always segmented in the sense that each block has at least one segment.

The content of each block is extracted as a single text unit with as many segments as there are in the block.

Existing Translations

Segmented entries may have translations. In this case the text of the target is extracted along with the source.

If the segment is labeled with gtmt="true" and not with modified="true", the source and target are also set as an AltTranslationsAnnotation annotation.

Revisions

If a segment has revisions: only the latest translation is extracted. The translations in the <revisions> elements are ignored.

The filter does not create revision entries: If there was a translation in the original TXML document, the updated translation overwrites it. No revision is created in the TXML document with the translation present before the merge.

Parameters

Allow empty target segments in output — Set this option to allow empty translations to be written out as such in output. If this option is not set, a copy of the source is used in place of the empty translation.

Limitations

  • The TXML format does not allow to have different source and target content for the part between segments, therefore if any of those parts changes after extraction, the difference cannot be represented when merging, only the source representation is preserved.