QuEst SVM Model Builder Step

From Okapi Framework
Jump to: navigation, search

Overview

This step allows you to build QuEst prediction model that can be used with the QuEst Quality Estimation Step.

Takes: Filter events. Sends: Filter events.

Note: This step is part of the QuEst Plugin and is not included in the general distribution.

The input of the step can be either:

  • A list of bilingual documents (like TMX, XLIFF, etc.) containing translation pairs, along with text files with the scores for all the translations.
  • A list of pairs of text files containing source and target language phrases (one per line, aligned at segment-level), along with text files containing the scores for the translations.

The output of the step is a text file produced by the SVM Java library, which represents a Translation Quality Prediction Model. This text file is one of the parameters required by the QuEst Quality Estimation Step for it to produce quality estimates for translations.


Parameters

Temporary folder — Path to folder where will be saved the temporary files produced by the step. This path must refer to an already existing folder.

Features file — Path to XML file describing which features should be calculated for the translation pairs in order to build the quality prediction model. TODO: Refer the reader to a page with an example file.

Lowercase the input — Select if translations should be lower-cased.

Alignment probability file — Path to file containing the alignment probabilities for source and target languages. The step is able to automatically produce this file if the parameter is left blank, but only if the Source training corpus and Target training corpus are no longer than a couple hundreds sentences long. We recommend that you install and run GIZA++ using this GIZA++ Installation and Running Tutorial to produce this file. Note that this tutorial refers to a personalized installation method, meaning that there are special steps which you must follow in order to produce the alignment probability file.

Source n-gram file — Path to file containing the n-gram counts for the source language. The step is able to automatically produce this file if the parameter is left blank, given that the SRILM binaries folder parameter is accurately provided. You may also run SRILM independently following this SRILM Installation and Running Tutorial to obtain this file.

Target n-gram file — Path to file containing the n-gram counts for the target language. The step is able to automatically produce this file if the parameter is left blank, given that the SRILM binaries folder parameter is accurately provided. You may also run SRILM independently following this SRILM Installation and Running Tutorial to obtain this file.

Source training corpus — Path to file containing the training corpus for the source language. The file must contain one sentence per line.

Target training corpus — Path to file containing the training corpus for the target language. The file must contain one sentence per line.

Source language model — Path to file containing the language model for the source language. The step is able to automatically produce this file if the parameter is left blank, given that the SRILM binaries folder parameter is accurately provided. You may also run SRILM independently following this SRILM Installation and Running Tutorial to obtain this file.

Target language model — Path to file containing the language model for the target language. The step is able to automatically produce this file if the parameter is left blank, given that the SRILM binaries folder parameter is accurately provided. You may also run SRILM independently following this SRILM Installation and Running Tutorial to obtain this file.

SRILM binaries folder — Path to the folder containing the binaries of your SRILM installation. The folder must be the one which contains the binaries named ngram and ngram-count. If you have any problems installing SRILM, try the SRILM Installation and Running Tutorial.

Prediction model output file — Path to the Translation Quality Prediction Model, which is the output text file produced by the SVM Java library. This file is required by the QuEst Quality Estimation Step in order for it to calculate quality estimates for translations.


Limitations

  • Part of this step is triggered at the END_BATCH event, so you cannot use the result of the step in the pipeline where it is called.