The XLIFF Filter is an Okapi component that implements the IFilter interface for XLIFF 1.2 (XML Localisation Interchange File Format) documents. The filter is implemented in the class
net.sf.okapi.filters.xliff.XLIFFFilter of the library.
XLIFF is an OASIS Standard that defines a file format for transporting translatable text and localization-related information across a chain of translation and localization tools.
The XLIFF 1.2 specification is at http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html.
The filter decides which encoding to use for the input document using the following logic:
- If the document has an encoding declaration it is used.
- Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).
If the output encoding is UTF-8:
- If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
- If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.
The type of line-breaks of the output is the same as the one of the original input.
<trans-unit> element has a
xml:space="preserve" attribute, the white spaces inside the content of its source and target is left as it. If the
xml:space is not present, or as a value different from "
preserve", the content of the source and target is unwrapped.
The entries of the document are mapped as follow:
||The approved property of the target in the text unit.|
|| The |
On output, new
||The source text of the text unit.|
||The target text of the text unit.|
||The name of the text unit.|
||The type of the text unit.|
||The coordinates property of the text unit.|
|| The targetLanguage property of the sub-document for the given |
|| The its-storageSize property of the text unit. If the |
|| Segmentation of the text unit.
|ITS and ITSXLF annotations.||See the ITS Components page for more details.|
Use the trans-unit id attribute for the text unit name if there is no resname — Select this option to use the value of the
id attribute of the
<trans-unit> element as a fall-back value if
resname is not present. This may be useful for XLIFF documents that use
resname-like values for
id but do not bother providing
resname (as they should).
Ignore the segmentation information in the input — Set this option to ignore any segmentation information contains in the input XLIFF. When this option is set all segmented content are reduced to a new unsegmented content when extracted. Note that any
<alt-trans> data attached to a given segment is also lost.
Escape the greater-than characters — Set this option to have all greater-than characters ('
>') escaped as "
>" in the output.
Add the target-language attribute if not present — Set this option to add the
target-language attribute in
<file> if it is not present.
Override the target language of the XLIFF document — Set this option to override the language of the target set in the input XLIFF. When this option is set, the value of the
target-language attribute and the value of
xml:lang in all the
<target> elements are set to the target language specified by the user, regardless of their original values.
This is useful when using an XLIFF document as a template for several outputs in different target languages.
Note that depending on what the original XLIFF document contains, this option may result in outputs where for example existing
<alt-trans> elements do not correspond to the target language of a given
Allow empty <target> elements in XLIFF document — Set this option to prevent copying of source text into an empty target.
Type of output segmentation — Select one of the following types of segmentation representation to use for the output:
- Segment only if the input text unit is segmented: Each text unit in the output are represented with a
<seg-source>element only if the original text unit was already represented like this in the input file.
- Always segment (even if the input text unit is not segmented): Each text unit in the output are represented with a
<seg-source>element, even if the original text unit was not segmented, and even if the whole content of the text unit is made of a single segment.
- Never segment (even if the input text unit is segmented): None of the text unit in the output is segmented, even if they were in the input file. All
<seg-source>elements are removed.
- Segment only if the entry is segmented and regardless how the input was: Only text units made of more than one part are segmented, regardless whether or not they were segmented in the input.
Allow addition of new <alt-trans> elements — Set this option to allow the addition of new
<alt-trans> elements in the output. For example, when this option is set and the Leveraging Step is applied to a given XLIFF input document, its output includes the translation matches possibly found during leveraging.
Include extra information — Set this option to include non-standard information (e.g. match types) in the added
Use the <g> notation in new <alt-trans> elements — Set this option to use the
<g>/<x> notation for inline codes (instead of the
<bpt>/<ept>/<ph> notation) in the added
Allow modification of existing <alt-trans> elements — Set this option to allow
<alt-trans> elements that exist in the input to be modified in the output. The existing entries will be treated like added ones.
Use a custom XML stream parser — Set this option to use an XML stream parser different from the default one. The default parser for this filter is the Woodstox XML parser. The reason for this is that the Java parser that comes with most VMs implements the collapsing of whitespace characters using a recursive method that can cause a stack overflow error in XLIFF document with large chunks of element content (e.g. in SDLXLIFF files).
Factory class for the custom XML stream parser — Enter the name of the factory class that will instantiate the custom XML stream parser you want to use. For example:
- The content of the
<sub>element is currently not supported as text. Any element found inside a
<sub>) is included in the code of the parent inline element. A warning is generated when a
<sub>element is detected.
- The special marker
<mrk mtype='protected'>is supported by converting the content into an inline code. Note that if a marker
<mrk mtype='x-its-translate-yes'>is used within such marker, it is not supported and its content is placed into the original data part of the inline code.
- The pre-defined configuration
okf_xliff-sdlprovide additional support for SDLXLIFF files. One notable limitation currently is that the SDL properties
originare stored in the TextContainer of the target rather than in each segment, and if there are several segments, the values are the ones of the last segment. From M35 on, the three properties are also stored at the segment level with the correct values. But those segment-level properties are read-only.