How to Machine-Translate a TMX File

From Okapi Framework
Jump to navigation Jump to search

Imagine that you have a TMX file of segments to be translated, and you need to fill it with machine-translation entries so you can use the file as a fall-back TM in a tool where you do not have access to machine translation.

Warning: You must be careful with the resulting file: It will be a TMX file with raw (un-edited) machine translation in it, without indication that the content is MT rather than a final translation. Using MT on TMX files is usually done within a specific process, where ultimately the MT candidates are post-edited in a controlled environment.

There are several ways to do this with the Okapi tools:

Using the Leveraging Step

If you want to use a machine translation system for which you have a connector, you can easily create a simple pipeline that uses the Leveraging Step.

1. Start Rainbow.

2. Drop your TMX document in the Input List 1 tab.

3. In the Languages and Encoding tab: select the proper languages and encoding. For a TMX document, only the target (output) encoding will be used as the input encoding is detected automatically.

4. In the Other Settings tab: if needed, change the name or location for the output file. We will keep the default which is the same name as the input file, with an extra .out prepended to the .tmx extension.

5. Select Utilities > Edit / Execute Pipeline. This opens the Edit / Execute Pipeline dialog box where you create the new pipeline.

We need three steps:

6. Use the Add Step button to add those three steps in that order.

The first and last steps have no parameters as they take their information from Rainbow's main tabs.

7. Select the Leveraging Step to set up your machine translation option. First,make sure the option Leverage the text units with existing translations is set. Those "existing translations" come from the connector you select. In this example we want to use a machine translation system, but you could also use translation memories. In our case an MT system accessible to everyone is Google Translate: Select the Google Translate Services. For more information on other systems see the "Connectors" page.

8. Make sure the option Leverage only if the match is equal or above this score has its value set to 95 or lower. Translation proposals coming from the Google MT Connector have a score of 95. If you set a higher value, no translation will be retained.

9. Make sure the option Fill the target with the leveraged translation is set. This tells the tool to copy the translation coming from the connector into the target.

Note that if there is already a target entry (empty or with text) the machine translation is copied over the existing one. The original target content is not overwritten by the machine translation is the following cases:

  • If the text unit is marked as non-translatable.
  • If the target as an approved property set to "yes".

None of those condition is likely to exist in text units coming directly from a TMX file.

Notice that you could generate a TMX document with the translation directly from this step, instead of re-writing our original TMX. But in this case we want to translate the original TMX file, keeping all its attributes, comments, etc. So the best way to do this is to re-write the original file with the modified text units.

10. At this point you are ready to process the input file. Click Execute to run the pipeline.

Depending on the number of files you process and their size it may take some time. Note also that the translation is fetched from the Internet so that may slow down the process a bit too.

When it is done you should have an output TMX document in the same directory as the input one, and that file should have the machine translation for each source entry.

Using the Batch Translation Step

In some cases you may have an MT system for which there is no connector in Okapi. You still can use it, as long as a few requirements are fulfilled:

  • the MT system must be able to translate HTML files
  • the MT system must have a command-line mode

For example, a system that fills those requirements is ProMT. It can translate HTML documents, and can be run from the command-line. Note that some version of ProMT are capable of taking the TMX file directly in input, but for the purpose of this example we assume you cannot do that.

1. Start Rainbow.

2. Drop your TMX document in the Input List 1 tab.

3. In the Languages and Encoding tab: select the proper languages and encoding. For a TMX document, only the target (output) encoding will be used as the input encoding is detected automatically.

4. Select Utilities > Batch Translation. This is a pre-defined pipeline, with a single step: the Batch Translation Step.

5. In the Command line field enter the DOS command that calls ProMT to translate an HTML document. For the input file use the variable ${inputPath}, for the output use the variable ${outputPath}. You also need to specify the language pair with the /d parameter. You can use the two variable ${srcLangName} and ${trgLangName} for this.

"C:\Program Files\PRMT9\FILETRANS\FileTranslator.exe" ${inputPath} /as /ac /d:${srcLangName}-${trgLangName} /o:${outputPath}

6. Make sure the option Create the following TMX document is set and enter the full path of the TMX document to create.

6. At this point you are ready to execute the process: click Execute.

This will take the input TMX, convert chunks of its content into temporary HTML file, run the command line on that HTML document, get back the translation from the translated HTML and place it into the TMX output.

Note that because the Batch Translation Step is a step you can alos use it in your own pipelines, along with other steps, to perform a set of customized tasks that corresponds to your specific needs. See "How to Create a Pipeline in Rainbow" for more details.