XLIFF Filter

From Okapi Framework
Jump to navigation Jump to search

Overview

The XLIFF Filter is an Okapi component that implements the IFilter interface for XLIFF 1.2 (XML Localisation Interchange File Format) documents. The filter is implemented in the class net.sf.okapi.filters.xliff.XLIFFFilter of the library.

XLIFF is an OASIS Standard that defines a file format for transporting translatable text and localization-related information across a chain of translation and localization tools.

The XLIFF 1.2 specification is at http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html.

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the following logic:

  • If the document has an encoding declaration it is used.
  • Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).

Output Encoding

If the output encoding is UTF-8:

  • If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
  • If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

White Spaces

If a <trans-unit> element has a xml:space="preserve" attribute, the white spaces inside the content of its source and target is left as it. If the xml:space is not present, or as a value different from "preserve", the content of the source and target is unwrapped.

Mapping

The entries of the document are mapped as follow:

XLIFF Document Resource
The approved attribute in <trans-unit>. The approved property of the target in the text unit.
The <note> elements.
  • The note property of the source if the annotates attribute is "source".
  • The note property of the target if the annotates attribute is "target".
  • The note property of the text unit in all other cases.
The <alt-trans> element that has its alttranstype attribute set to "proposal" (or has no alttranstype defined). The AltTranslationsAnnotation annotation. If the element has a mid attribute the annotation is assigned to the corresponding target segment, otherwise it is assigned to the target container. Once mapped to Okapi annotations, the list of entries is sorted based on match types and score.
  • The value of the match-quality attribute is used as the annotation's score if it is a integer or a percentage, it is ignored otherwise and the score set to 0.
  • The value of the origin attribute is used as the annotation's origin. If there is no origin attribute, the annotation's origin is set to "SourceDoc".
  • The Okapi extension attribute matchType is used as the annotation's match type if present, otherwise if there is a score, the entries are have their match type set to EXACT if they have a score above 99, or to FUZZY if they have a score above 0.

On output, new <alt-trans> elements can be added if the option Allow addition of new <alt-trans> elements is set.

The <source> element. The source text of the text unit.
The <target> element. The target text of the text unit.
The resname attribute. (This may also be the id attribute if the option is set) The name of the text unit.
The restype attribute The type of the text unit.
The coord attribute. The coordinates property of the text unit.
The target-language attribute. The targetLanguage property of the sub-document for the given <file>.
The maxbytes attribute. The its-storageSize property of the text unit. If the its:storageSizeEncoding attribute is present, its value is used for the encoding. Otherwise UTF-8 is the default encoding used to compute the byte length. Note that its:storageSize is not recognized. The size must be declared using maxbytes.
The <seg-source> element. Segmentation of the text unit.
  • If the content of <seg-source> without the segment markers does not match the content of <source> element, the segmentation is not carry over into the resource.
  • If the content of <seg-source> has no explicit <mrk mtype='seg'> element, it is treated as if the whole content is a single segment.
ITS and ITSXLF annotations. See the ITS Components page for more details.

Parameters

Use the trans-unit id attribute for the text unit name if there is no resname — Select this option to use the value of the id attribute of the <trans-unit> element as a fall-back value if resname is not present. This may be useful for XLIFF documents that use resname-like values for id but do not bother providing resname (as they should).

Ignore the segmentation information in the input — Set this option to ignore any segmentation information contains in the input XLIFF. When this option is set all segmented content are reduced to a new unsegmented content when extracted. Note that any <alt-trans> data attached to a given segment is also lost.

Escape the greater-than characters — Set this option to have all greater-than characters ('>') escaped as "&gt;" in the output.

Add the target-language attribute if not present — Set this option to add the target-language attribute in <file> if it is not present.

Override the target language of the XLIFF document — Set this option to override the language of the target set in the input XLIFF. When this option is set, the value of the target-language attribute and the value of xml:lang in all the <target> elements are set to the target language specified by the user, regardless of their original values. This is useful when using an XLIFF document as a template for several outputs in different target languages.

Note that depending on what the original XLIFF document contains, this option may result in outputs where for example existing <alt-trans> elements do not correspond to the target language of a given <trans-unit> anymore.

Allow empty <target> elements in XLIFF document — Set this option to prevent copying of source text into an empty target.

Type of output segmentation — Select one of the following types of segmentation representation to use for the output:

  • Segment only if the input text unit is segmented: Each text unit in the output are represented with a <seg-source> element only if the original text unit was already represented like this in the input file.
  • Always segment (even if the input text unit is not segmented): Each text unit in the output are represented with a <seg-source> element, even if the original text unit was not segmented, and even if the whole content of the text unit is made of a single segment.
  • Never segment (even if the input text unit is segmented): None of the text unit in the output is segmented, even if they were in the input file. All <seg-source> elements are removed.
  • Segment only if the entry is segmented and regardless how the input was: Only text units made of more than one part are segmented, regardless whether or not they were segmented in the input.

Allow addition of new <alt-trans> elements — Set this option to allow the addition of new <alt-trans> elements in the output. For example, when this option is set and the Leveraging Step is applied to a given XLIFF input document, its output includes the translation matches possibly found during leveraging.

Include extra information — Set this option to include non-standard information (e.g. match types) in the added <alt-trans> elements.

Use the <g> notation in new <alt-trans> elements — Set this option to use the <g>/<x> notation for inline codes (instead of the <bpt>/<ept>/<ph> notation) in the added <alt-trans> elements.

Allow modification of existing <alt-trans> elements — Set this option to allow <alt-trans> elements that exist in the input to be modified in the output. The existing entries will be treated like added ones.

Use a custom XML stream parser — Set this option to use an XML stream parser different from the default one. The default parser for this filter is the Woodstox XML parser. The reason for this is that the Java parser that comes with most VMs implements the collapsing of whitespace characters using a recursive method that can cause a stack overflow error in XLIFF document with large chunks of element content (e.g. in SDLXLIFF files).

Factory class for the custom XML stream parser — Enter the name of the factory class that will instantiate the custom XML stream parser you want to use. For example: com.ctc.wstx.stax.WstxInputFactory.

Limitations

  • The content of the <sub> element is currently not supported as text. Any element found inside a <bpt>, <ept>, <ph>, and <it> (including <sub>) is included in the code of the parent inline element. A warning is generated when a <sub> element is detected.
  • The special marker <mrk mtype='protected'> is supported by converting the content into an inline code. Note that if a marker <mrk mtype='x-its-translate-yes'> is used within such marker, it is not supported and its content is placed into the original data part of the inline code.
  • The pre-defined configuration okf_xliff-sdl provide additional support for SDLXLIFF files. One notable limitation currently is that the SDL properties locked, conf and origin are stored in the TextContainer of the target rather than in each segment, and if there are several segments, the values are the ones of the last segment. From M35 on, the three properties are also stored at the segment level with the correct values. But those segment-level properties are read-only.