Microsoft Batch Translation Step

From Okapi Framework
Revision as of 22:36, 24 April 2019 by Kuro2 (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Retirement of version 2 API

MICROSOFT CONNECTOR of the Okapi stable releases will STOP WORKING at the end of April, 2019.

Microsoft will retire their version 2 API on 2019-4-30 as described in this page. Because of this, the Microsoft Connector found in the latest stable release, M37, will no longer work on and after 2019-5-01.

The support of the version 3 API has been added to Okapi in mid April after the M37 release. If you need to use Microsoft's machine translation service, please pick up the M38 snapshot version from here. Please note this is a minimal implementation and it does not support any new features such as profanity filtering,

Because the version 3 API no longer supports the translation memory, that aspect of function is not available even if you use the latest Okapi M38 snapshot version.

You will need an "azure key" to use the version 3 API. If you already have a key for version 2, the same key should work. For information on how to obtain an azure key, please see this page.

Information below is mostly out of date. It is kept as reference until full update of this page is done.


This step annotates text units of the input documents with Microsoft Translator candidates or/and creates a TM from them.

Takes: Filter events. Sends: Filter events (possibly annotated) or raw document.

You must have a "Client ID" and a "Client Secret" from Microsoft to use this step. If you get those by obtaining a Windows Live ID, and then registering an application in your Live account. See the MSDN pages for more information.

You must also respect Microsoft's Terms of Service. If you intend to use the Microsoft Translator API for commercial or high volume purposes, you would need to sign a commercial license agreement and provide your AppID to the Microsoft Translator team. For more details contact

Text units flagged as non-translatable are not send for translation.

Note that using the Leveraging Step with the Microsoft Translator Connector will produces MT results similar to this step. However, this step can process several text units at once and therefore is much faster.

Improving automatically MT output can be done in some cases. For example extra or missing spaces around inline codes can be fixed with the Space Check Step.


Client ID — The Client ID to use to connect to the MT server. See the MSDN pages for more information.

Client Secret — The secret corresponding to the Client ID.

Category — An optional category to use when working with trained engines. You can either enter directly the engine identifier (called 'category' in Microsoft Translator Hub), or you can use a keyword in the form @@@keyword@@@. If you specify a keyword you must specify a properties file in the Engine Mapping field.

The keyword can be a literal string or the ${domain} variable. When ${domain} is used, the variable is replaced by the first occurrence of the value for the ITS Domain annotation found on a text unit. Ideally this Domain annotation should be set on the first text unit of the first document processed. All batches of events translated before a domain annotation is found are translated with the empty category.

Note: As stated above, only the first occurrence of the Domain annotation has an effect on the selection of the engine.
Note: Also, because this step is working on batches, segments before the first occurrence of the Domain annotation but within the same batch will be translated with the engine specified by the domain. For example: If you have 100 events and the Events buffer is set to 50 and the first occurrence of the Domain annotation is in the 60th event: The first 50 events will be translated with the empty category and all the other events with the engine corresponding to the specified domain, including the events 51 to 59.

Engine Mapping — Enter the path of the properties file that contains the mapping between the category keywords and the Microsoft Translator Hub engine identifier. You can use the variables ${rootDir} and ${inputRootDir}, as well as any of the source or target locale variables (${srcLoc}, ${trgLoc}, etc). Leave the path empty to not use a mapping. The properties file is a list of lines in the form:



  • <keyword> is a case-sensitive string (without spaces, sign equal or periods) that corresponds to the keyword part in @@@keyword@@@.
  • <language> is the uppercase language code of the target locale to process.

For example, if you have the following engine mapping file:


To use the first engine (assuming you are translating into french), specify @@@travel@@@ in the Category. To use the third engine specify There is also a fallback mechanism where if you specify it would first look for and if not found it would look for a client2.DE. If no custom engine is found the generic Microsoft provided engine is used.

Events buffer — Enter the number of events to buffer for a single query to the engine. The largest the buffer, the fastest the processing. But there are limitations related to the volume of text you can process at once as well.

Maximum matches — Enter the maximum number of matches you want to allow per source text.

Threshold — Enter the score below which a match is not keep as a result. See the Microsoft Translator Connector to understand how scores are computed based on their match degree an rating values.

Query only entries without existing candidate — Set this option to send to Microsoft Translator only the text for which there is currently no candidate (i.e. annotations added by previous steps or coming from the original document).

Annotate the text units with the translations — set this option to add to the text units annotations that holds the matches found. Those annotations may be used later by other steps. Existing annotations are preserved.

Generate a TMX document — Set this option to create a TMX output. Enter the full path of the TMX document to generate. If another document exists already it will be overwritten. You can use the variables ${rootDir} and ${inputRootDir}, as well as any of the source or target locale variables (${srcLoc}, ${trgloc}, etc).

Send the TMX document to the next step — Set this option to have the generated TMX document as the only input raw document passed on to the next step of the pipeline. If this option is not set the filter events are passed on to the next step.

Mark the generated translation as machine translation results — Set this option to mark the TM entries generated as the result of machine translation. For example, when this option is set, the creationId attribute of the TMX <Tu> element is set to "MT!".

Fill the target with the best translation candidate — Set this option to copy the translation with the best score (and a score above or equal to the Fill threshold) into the target (if it is empty). Only the matches returned by the Microsoft Translator engine are taken into account.

Fill threshold — If the score of the best match is below the provided value, no translation is not copied into the target.


  • The Microsoft Translator API has some restriction for high volume usage. Contact Microsoft for details.
  • Only the first ITS Domain annotation in a batch is taken into account.
  • See also the limitations on the Microsoft Translator Connector.