QuEst Quality Estimation Step

From Okapi Framework
Jump to: navigation, search

Overview

This step allows you to set a quality estimation scores on translations.

Warning: This step is under development

Takes: Filter events. Sends: Filter events.

Note: This step is part of the QuEst Plugin and is not included in the general distribution.

The input of the step can be either:

  • A list of bilingual documents (like TMX, XLIFF, etc.) containing translation pairs, along with text files with the scores for all the translations.
  • A list of pairs of text files containing source and target language phrases (one per line) aligned at segment-level.

The step generates a TMX document with a plain text version of the source and target of each segment, a score representing the QuEst quality estimation of the translation, and other information.

The prediction model used by this step can be generated with the QuEst SVM Model Builder Step, and you can associate the quality scores generated with translated entries using the Properties Setting Step.


Parameters

Temporary folder — Path to folder where will be saved the temporary files produced by the step. This path must refer to an already existing folder.

Output folder — Path to folder where will be saved the output TMX document containing the quality scores produced.

Features file — Path to XML file describing which features should be calculated for the translation pairs in order to build the quality prediction model. TODO: Refer the reader to a page with an example file.

Lowercase the input — Select if translations should be lower-cased.

Alignment probability file — Path to file containing the alignment probabilities for source and target languages. The step is able to automatically produce this file if the parameter is left blank, but only if the Source training corpus and Target training corpus are no longer than a couple hundreds sentences long. We recommend that you install and run GIZA++ using this GIZA++ Installation and Running Tutorial to produce this file. Note that this tutorial refers to a personalized installation method, meaning that there are special steps which you must follow in order to produce the alignment probability file.

Source n-gram file — Path to file containing the n-gram counts for the source language. The step is able to automatically produce this file if the parameter is left blank, given that the SRILM binaries folder parameter is accurately provided. You may also run SRILM independently following this SRILM Installation and Running Tutorial to obtain this file.

Target n-gram file — Path to file containing the n-gram counts for the target language. The step is able to automatically produce this file if the parameter is left blank, given that the SRILM binaries folder parameter is accurately provided. You may also run SRILM independently following this SRILM Installation and Running Tutorial to obtain this file.

Source training corpus — Path to file containing the training corpus for the source language. The file must contain one sentence per line.

Target training corpus — Path to file containing the training corpus for the target language. The file must contain one sentence per line.

Quality prediction model file — Path to the file containing the quality prediction model produced by the QuEst SVM Model Builder Step.


Limitations

  • Part of this step is triggered at the END_BATCH event, so you cannot use the result of the step in the pipeline where it is called.