TTX Filter

From Okapi Framework
Revision as of 17:31, 13 October 2011 by Ysavourel (talk | contribs) (→‎Parameters)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

The TTX Filter is an Okapi component that implements the IFilter interface for Trados TTX documents. TTX is an XML-based bilingual format used by some of the versions of the Trados tools, and supported by several other tools. There are no official public specifications available.

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the following logic:

  • If the document has an encoding declaration it is used.
  • Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).

Output Encoding

If the output encoding is UTF-8:

  • If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
  • If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Segmentation

TTX is a format where the target text cannot be represented if the text is not segmented. Thus, the output of the filter includes any new <Tu> and <<Tuv> needed.

In auto-detection mode, the filter tries to detect if the file has at least one existing segment. If one existing segment is detected only the text in the existing segments are extracted (mode 1). If no segment is detected all text is extracted (mode 2).

You can also choose to force to extract only the text in existing segments, or to force to extract all text, whether it is segmented or not.

For example, in mode 2 (extract all), the following content:

...<Raw>
Part 1 <Tu MatchPercent="0">
<Tuv Lang="EN">Part 2</Tuv></Tu> Part 3.
</Raw>...

will be extracted as a single text unit with three segments:

[Part 1 ][Part 2][ Part 3]

But in mode 1 (extract existing segments only), only the segmented parts are extracted:

[Part 2]
Note: Segmentation is often a cause for interoperability issues. For a better compatibility with the tool that created the TTX files it is recommended to work with pre-segmented documents.

Existing Translations

Segmented entries may have translations. In this case the text of the target is extracted along with the source.

  • If the MatchPercent attribute of the TTX segment is above 100 and the Origin attribute is set to xtranslate, the translation is also added as an alternate translation annotation for the extracted segment and its match type set to EXACT_LOCAL_CONTEXT. Those entries correspond to the XU matches in TagEditor.
  • Otherwise, if the MatchPercent attribute of the TTX segment is above 99 the translation is also added as an alternate translation annotation for the extracted segment and its match type set to EXACT.
  • Otherwise, if the MatchPercent attribute of the TTX segment is above 1 and below 100 the translation is also added as an alternate translation annotation for the extracted segment and its match type set to FUZZY.

In both cases, if the translated segment has a Origin attribute, its value is carried over to the annotation.

Parameters

Extraction mode — Select the type of extraction to perform, based on possible existing segments.

  • Auto-detect existing segments: If at least one segment is detected, only existing segments are extracted. If no segment is detected, all text is extracted.
  • Extract only existing segments: Only existing segments are extracted. If the file is not pre-segmented no text is extracted.
  • Extract all: Extract all text, whether it is in existing segments or not.

Escape the greater-than characters in output — Set this option to have all greater-than characters ('>') escaped as &gt; in the output.

Limitations

The TTX element <df> may cause problems in some cases, for example when spanning across an external tag. For instance, when the TTX file contains the extraction of the following HTML code:

<p>text in <b>bold</p>
<p>Bold</b> text<>

The <df> element for bold will go across paragraph boundaries which are external tags in TTX. Those cases should be rare and you should report them.