TMX Filter

From Okapi Framework
Jump to navigation Jump to search


The TMX Filter is an Okapi component that implements the IFilter interface for TMX (Translation memory eXchange) documents. The filter is implemented in the class net.sf.okapi.filters.tmx.TmxFilter of the library.

TMX is a LISA Standard that defines a file format for transporting translation memory data from one translation tool to another. The TMX 1.4b specification is at

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the following logic:

  • If the document has an encoding declaration it is used.
  • Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).

Output Encoding

If the output encoding is UTF-8:

  • If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
  • If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.


The type of line-breaks of the output is the same as the one of the original input.


Read all target entries — Set this option to read all target <tuv> elements into the text unit. Otherwise only the selected target is read and all remaining ones become part of the skeleton. Default is True. Any effect this setting has depends on the following pipeline steps and the ability they have to process multiple targets.

Group all document parts skeleton into one — Set this option to consolidate the skeleton parts and send fewer events through the pipeline. Default is True. This is sufficient in most cases but as a pipeline developer sometimes you might want to have access to more fine-grained resources in the pipeline.

Exit when encountering invalid <tu>s — By default invalid <tu>s are skipped along with warning message(s). By using this default setting or ignoring the warning messages you might run the risk of getting a processed file that doesn't match the input file. Check this box if you want to be notified immediately of invalid content and want to correct the file before re-running it.

Creates or not a segment for the extracted <tu> — Use this option to set create a segment or not for each extracted <tu> entry. The following options are available:

  • Always creates the segment - Creates the segment regardless of what the value of the segtype attribute.
  • Never creates the segment - Never creates the segment, even if the segtype attribute is set to "sentence".
  • Creates the segment if segtype is 'sentence' or is undefined Creates the segment when the segtype attribute is set to "sentence" or if it is not defined.
  • Creates the segment only if segtype is 'sentence' Creates the segment only if the segtype attribute is set to "sentence".

Escape the greater-than characters — Set this option to have all greater-than characters ('>') escaped as "&gt;" in the output.

Duplicate property value separator string — This string will be used to separate duplicate property values. Default is ", "


The <sub> element is not supported. When such element is found, a warning is issued, and the element content is put with the content of its parent element.
The filter is not able to reconstruct any DTD declaration.