Translation Comparison Step
Overview
This step compares the translation between two (or three) documents.
Takes: Filter events. Sends: Filter events.
The input is two or three documents, the first document contains the base translations, the second one and optionally the third one, contain the translations to compare to the base translations. The input documents must have the same number of text units and the text units must be in the same order (but you can have different events between text units).
For each entry the step compares the translated texts and calculates scores that measure how similar the text of the second document (and optionally the third one) is compared to the text of the baseline translation.
Each entry has two scores, ED-Score and FM-Score. The range of the scores is 0 and 100, where 0 represents the extreme difference between the segments, and 100 – absolutely no difference between the two segments.
- ED-Score is an Edit Distance score based on the Levenshtein distance and additional processing. You can find more information on that metric at http://en.wikipedia.org/wiki/Levenshtein_distance.
- FM-Score is a Fuzzy Match score based on Sørensen-Dice's Coefficient with 3-grams. You can find more information on the coefficient at http://en.wikipedia.org/wiki/Dice's_coefficient.
The Summary section of the report has additional statistics that may help with understanding of the differences between the compared documents.
Table of scores. This is a table listing the number of matches broken up into 11 groups. If a word count is available, the corresponding source word count is included in the table.
- Total Number of Segments.
- Total Number of Words.
- Average word count per segment. The formula is (Total Number of Words) / (Total Number of Segments).
- Average ED-Score (by segment). All ED-Scores are averaged among all segments. The formula is (The sum of all ED scores) / (Total Number of Segments).
- Average FM-Score (by segment). All FM-Scores are averaged among all segments. The formula is (The sum of all FM_scores) / (Total Number of Segments).
- Average ED-Score (by word). Each word in a segment is assigned the ED-Score of the segment. Then all word ED-Scores are averaged among all words. The formula is (The sum of all (words in a segment) * (segment’s ED-Score)) / (Total Number of Words).
- Average FM-Score (by word). Each word in a segment is assigned the FM-Score of the segment. Then all word FM-Scores are averaged among all words. The formula is (The sum of all (words in a segment) * (segment’s FM-Score)) / (Total Number of Words).
- Edit Effort Score. This is a numerical value that corresponds to the amount of effort an editor may have applied when changing text from first to the second document. In order to calculate it, we first average the by-word Average ED-Score and Average FM-Score. We use the ‘by-word’ scores, because we want to account for the length of segments, e. i. longer segments take more effort. Then we reverse the scale, so that 0 represents no effort and 100 represents most effort that is applied to changing the text from the first document to the second. The formula is 100 - (Average ED-Score (by word) + Average FM-Score (by word)) / 2.
.
The utility outputs an HTML document for each input documents. The HTML document lists the source text, the translation text and the resulting scores. You can also generate a TMX document containing the same information.
Parameters
Generate output tables in HTML — Set this option to create one HTML file for each input document, where the results of the comparisons are listed. The name of the HTML result file is the name of the first document, plus an .html
extension.
When this option is selected the step also generates a tab-delimited file with the scores of each entry in the document. The name of this file is the name of the first document, plus an .txt
extension.
Use generic representation (e.g. <1>...</1>) for the inline codes — Set this option to output the HTML report using a generic representation for inline codes (e.g. <1>...</1>
. If this option is not set, the output uses the original code whenever possible. Note that some formats may have long and complicated original codes.
Open the first HTML output after completion — Set this option to open the first HTML result file created when the process is completed.
Generate a TMX output document — Set this option to create one single TMX document that contains all the translations compared and their scores, for all input documents. You can use the variables ${rootDir}
and ${inputRootDir}
, as well as any of the source or target locale variables (${srcLoc}
, ${trgloc}
, etc).
Enter the full path of the TMX document to generate.
Suffix for the target language code of the document 2 — Enter the suffix to append to the target language code for the second set of target entries. For example, enter -mt
to indicate that the second target language is a machine-translation generated text. If you do not enter a suffix, the language code for both target entries will be the same and the TMX document will include only the last one.
Suffix for the target language code of the document 3 — Enter the suffix to append to the target language code for the third set of target entries.
Use alt-trans for document 1 — Set this option if the document 1 is an XLIFF file and you want to use as the base target the target of a given <alt-trans>
element. This option is to, for example, compare the human translation done by a linguist in a XLIFF file with the MT candidate that was provided for the same entry: You simply specify the same file for both document 1 and 2.
Value of the original attribute — Enter the value of the original
attribute of the <alt-trans>
element that should be used. If no element is found for a given trans-unit, that trans-unit will not be include in the report.
Label for the document 1 — Enter an optional label that is used in the report to label the base translation.
Label for the document 2 — Enter an optional label that is used in the report to label the second translation.
Label for the document 3 — Enter an optional label that is used in the report to label the optional third translation.
Take into account case differences — Set this option to take into account case differences when calculating the scores.
Take into account white-space differences — Set this option to take into account white-space differences when calculating the scores.
Take into account punctuation differences — Set this option to take into account punctuation differences when calculating the scores.
Append the average results to a log — Set this option to append the average scores for each document at the end of a log file. Enter the full path of the log file.
The log file created is a tab-delimited file in UTF-8 with the following fields: Time stamp in UTC, path of the base document, path of the second document, edit-distance score, fuzzy score. For example:
2013-04-01 16:09:14+0000 /C:/myProject1/machine.tmx /C:/myProject1/human.tmx 93.75 96.50
Limitations
- None known.