Moses Text Filter
The following is an example of Moses InlineText file. The translatable text is highlighted in yellow, while the inline codes are highlighted in green.
Text in the first entry. Text of the second entry<lb/>which spans<lb/>several lines Third entry. Fourth entry with <g id="1">bold words</g> and some code:<x id="2"/>
The example above has four lines, read as four different text units.
Inline codes are represented by
<bx id="N"> and
<ex id="N"> where
N is the identifier of the code.
Line-breaks are represented by
The filter decides which encoding to use for the input document using the following logic:
- If the document has a BOM, it is used to determine the encoding.
- Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).
The output encoding of the file is always forced to UTF-8.
The type of line-breaks of the output is the same as the one of the original input.
Original Document and Moses InlineText File
It is important to understand that reading a document and reading its corresponding extracted Moses file may give you different text units. The reason is that each segment (or unsegmented text unit) of the original document is extracted as a single entry in the Moses file. When a text unit of the original document count several segments, several entries are generated.
Because the Moses file does not have any mean to mark that a several entries belong to the same text unit, when you read the Moses file you will get more text unit than there is in the original document.
To know exactly to which original text unit a Moses file entry corresponds, you have to process both the original file and its corresponding Moses file. This is done, for example, in the Moses InlineText Leveraging Step where the Moses file's entries are re-grouped into their original text units.
This filter has no parameters.