XLIFF Filter
Overview
The XLIFF Filter is an Okapi component that implements the IFilter interface for XLIFF 1.2 (XML Localisation Interchange File Format) documents. The filter is implemented in the class net.sf.okapi.filters.xliff.XLIFFFilter
of the library.
XLIFF is an OASIS Standard that defines a file format for transporting translatable text and localization-related information across a chain of translation and localization tools.
The XLIFF 1.2 specification is at http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html.
Processing Details
Input Encoding
The filter decides which encoding to use for the input document using the following logic:
- If the document has an encoding declaration it is used.
- Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).
Output Encoding
If the output encoding is UTF-8:
- If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
- If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.
Line-Breaks
The type of line-breaks of the output is the same as the one of the original input.
White Spaces
If a <trans-unit>
element has a xml:space="preserve"
attribute, the white spaces inside the content of its source and target is left as it. If the xml:space
is not present, or as a value different from "preserve
", the content of the source and target is unwrapped.
Mapping
The entries of the document are mapped as follow:
XLIFF Document | Resource |
The approved attribute in <trans-unit> .
|
The approved property of the target in the text unit. |
The <note> elements.
|
|
The <alt-trans> element that has its alttranstype attribute set to "proposal " (or has no alttranstype defined).
|
The AltTranslationsAnnotation annotation. If the element has a mid attribute the annotation is assigned to the corresponding target segment, otherwise it is assigned to the target container. Once mapped to Okapi annotations, the list of entries is sorted based on match types and score.
On output, new |
The <source> element.
|
The source text of the text unit. |
The <target> element.
|
The target text of the text unit. |
The resname attribute. (This may also be the id attribute if the option is set)
|
The name of the text unit. |
The restype attribute
|
The type of the text unit. |
The coord attribute.
|
The coordinates property of the text unit. |
The target-language attribute.
|
The targetLanguage property of the sub-document for the given <file> .
|
The maxbytes attribute.
|
The its-storageSize property of the text unit. If the its:storageSizeEncoding attribute is present, its value is used for the encoding. Otherwise UTF-8 is the default encoding used to compute the byte length. Note that its:storageSize is not recognized. The size must be declared using maxbytes .
|
The <seg-source> element.
|
Segmentation of the text unit.
|
ITS and ITSXLF annotations. | See the ITS Components page for more details. |
Parameters
Use the trans-unit id attribute for the text unit name if there is no resname — Select this option to use the value of the id
attribute of the <trans-unit>
element as a fall-back value if resname
is not present. This may be useful for XLIFF documents that use resname
-like values for id
but do not bother providing resname
(as they should).
Ignore the segmentation information in the input — Set this option to ignore any segmentation information contains in the input XLIFF. When this option is set all segmented content are reduced to a new unsegmented content when extracted. Note that any <alt-trans>
data attached to a given segment is also lost.
Escape the greater-than characters — Set this option to have all greater-than characters ('>
') escaped as ">
" in the output.
Add the target-language attribute if not present — Set this option to add the target-language
attribute in <file>
if it is not present.
Override the target language of the XLIFF document — Set this option to override the language of the target set in the input XLIFF. When this option is set, the value of the target-language
attribute and the value of xml:lang
in all the <target>
elements are set to the target language specified by the user, regardless of their original values.
This is useful when using an XLIFF document as a template for several outputs in different target languages.
Note that depending on what the original XLIFF document contains, this option may result in outputs where for example existing <alt-trans>
elements do not correspond to the target language of a given <trans-unit>
anymore.
Allow empty <target> elements in XLIFF document — Set this option to prevent copying of source text into an empty target.
Type of output segmentation — Select one of the following types of segmentation representation to use for the output:
- Segment only if the input text unit is segmented: Each text unit in the output are represented with a
<seg-source>
element only if the original text unit was already represented like this in the input file. - Always segment (even if the input text unit is not segmented): Each text unit in the output are represented with a
<seg-source>
element, even if the original text unit was not segmented, and even if the whole content of the text unit is made of a single segment. - Never segment (even if the input text unit is segmented): None of the text unit in the output is segmented, even if they were in the input file. All
<seg-source>
elements are removed. - Segment only if the entry is segmented and regardless how the input was: Only text units made of more than one part are segmented, regardless whether or not they were segmented in the input.
Allow addition of new <alt-trans> elements — Set this option to allow the addition of new <alt-trans>
elements in the output. For example, when this option is set and the Leveraging Step is applied to a given XLIFF input document, its output includes the translation matches possibly found during leveraging.
Include extra information — Set this option to include non-standard information (e.g. match types) in the added <alt-trans>
elements.
Use the <g> notation in new <alt-trans> elements — Set this option to use the <g>/<x>
notation for inline codes (instead of the <bpt>/<ept>/<ph>
notation) in the added <alt-trans>
elements.
Allow modification of existing <alt-trans> elements — Set this option to allow <alt-trans>
elements that exist in the input to be modified in the output. The existing entries will be treated like added ones.
Use a custom XML stream parser — Set this option to use an XML stream parser different from the default one. The default parser for this filter is the Woodstox XML parser. The reason for this is that the Java parser that comes with most VMs implements the collapsing of whitespace characters using a recursive method that can cause a stack overflow error in XLIFF document with large chunks of element content (e.g. in SDLXLIFF files).
Factory class for the custom XML stream parser — Enter the name of the factory class that will instantiate the custom XML stream parser you want to use. For example: com.ctc.wstx.stax.WstxInputFactory
.
Limitations
- The content of the
<sub>
element is currently not supported as text. Any element found inside a<bpt>
,<ept>
,<ph>
, and<it>
(including<sub>
) is included in the code of the parent inline element. A warning is generated when a<sub>
element is detected. - The special marker
<mrk mtype='protected'>
is supported by converting the content into an inline code. Note that if a marker<mrk mtype='x-its-translate-yes'>
is used within such marker, it is not supported and its content is placed into the original data part of the inline code. - The pre-defined configuration
okf_xliff-sdl
provide additional support for SDLXLIFF files. One notable limitation currently is that the SDL propertieslocked
,conf
andorigin
are stored in the TextContainer of the target rather than in each segment, and if there are several segments, the values are the ones of the last segment. From M35 on, the three properties are also stored at the segment level with the correct values. But those segment-level properties are read-only.