Format Conversion Step

From Okapi Framework
Jump to: navigation, search


This step creates output in a given format using the data extracted from an input document. It allows to convert input documents from one format to another.

Takes: Filter events. Sends: Filter events.

The documents generated by this steps are not meant to be merged back into their original format, for that purpose use the Raw Document to Filter Events Step, then the Filter Events to Raw Document Step to merge the extracted data back.


Output format — Select the type of output file to generate. The following formats are available:

  • Tab-Delimited Table: Output in tab-delimited text format. Encoded in UTF-8, the first column is the content of the source and the second (if any) is the content of the target. Literal tab characters are escaped as \t.
  • Parallel Corpus Files: Output is made of two plain text files (inline codes are stripped out), where each entry is written in a separate line. the first file contains the source text, the second the target text. Each file has the code of its corresponding locale appended to its name. Such files are used for example in systems like Moses MT.
  • Word Table: Output in a table in RTF format. Inline codes are represented in a generic format. Each entry is written in a separate row. When the input document is a TTX file, the match percentage is displayed in the Type column.

Output only approved entries — Set this option to output only the entries that have the target property approved set to "yes". For example "fuzzy" entries in a PO file have this property set to "no". Note that entries without this property are assumed to be non-approved.

Output generic inline codes — Set this option to use generic inline codes in the output document. A generic inline code is in form <N>, </N>, and <N/> where N is the number identifier of the code. If this option is not set, the inline codes are output in their original format (except for references). If the output format is TMX, this option is ignored and the inline codes are in TMX format.

Overwrite if source is the same — This option is applicable to the Pensieve TM output only. If this option is set and an entry to import has the same source text as one or more entries in the existing TM, all the matching existing entries will be replaced by the new one. Do not set this option if you want to allow different translations for the same source.

Do not output entries without text — Set this option to not output the entries of the input document that have no text in there content (i.e. they are empty or have only inline codes).

Create a single output document — Set this option to generate one output document for all input documents. If this option is not set, one output document is created for each input document.

Output path — Enter the full path of the output document to generate. Note that for a Pensieve TM the output must be a directory. This field is enabled only if the option Create a single output document is set. You can use the variable ${rootDir} in the path.

Output paths are the input paths plus the new format extension — Set this option to create output files with their paths that are the same as the input file with and additional extension denoting the format of the output: .po, .tmx, etc. If this option is not set, the outputs will be the ones specified by the calling application. Note that this option is only available if the option Create a single output document is not set, that is when each input has a corresponding output file.


None known.