Okapi Framework - User contributions [en]

Tikal - Extraction Commands

2021-04-12T21:51:39Z

Kuro2:

{{Tikal Common Menu}}
__TOC__
==Extract Files==

This command extracts the translatable content of one or more given files into an [[XLIFF|XLIFF document]]. You can then use any XLIFF-aware translation tool to translate the document (See "[[How to Translate XLIFF Documents]]" for more information). When the translation is done, you can use the [[#Merge Files|Merge Files command]] to create a new translated file in its original format.

The XLIFF documents created are placed in the same directories as the original files, and have the same name with an additional .xlf extension.

By default, some extensions are mapped to a specific filter configuration (for example: <code>.docx</code> to <code>okf_openxml</code>, <code>.odt</code> to <code>okf_openoffice</code>, <code>.po</code> to <code>okf_po</code>, etc.). But you can define your own configuration and specify it as well using the <code>-fc</code> option. To get a list of all available filter configurations use the [[Tikal - Miscellaneous Commands#List Filter Configurations|List Filter Configurations command]]. For more details the filters available and their configurations, see each [[Filters|filter's documentation]].

You can use the <code>-seg</code> option to specify that the extracted text should be segmented. Use <code>-seg</code> without file name to use the default segmentation rules, use "<code>-seg myRules.srx</code>" to specify your own rules. The rules file must be in SRX format. The segments are marked up according the XLIFF 1.2 specifications.

The syntax of this command is:

-x [options] inputFile [inputFile2...]

Where the options are:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code>-fc configId</code> || The identifier of the filter configuration to use for the extraction.
|- valign="top"
| <code>-ie encoding</code> || The encoding name of the input files. this is used only if the filter cannot detect the encoding from the input file itself.
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language of the input files. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language for the output (also used in the input if the input documents are multilingual). [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-seg [srxFile]</code> || The segmentation rules to utilize. To specify the default rules that come with the installation, use <code>-seg</code> without filename. The default rules are in <code>config/defaultSegmentation.srx</code> in your Okapi main directory.
|- valign="top"
| <code>-rd rootDirectory</code> || The root directory (by default the user's home directory).
|- valign="top"
| <code>-od outputDirectory</code> || The directory where to place the output.
|- valign="top"
|nowrap="nowrap"| <code>-pen tmDirectory| -tt [hostname[:port]]| -gs configFile| -mm [key]| -gg configFile| -apertium [configFile]| -ms configFile| -tda configFile| -lingo24 configFile| -mmt url context| -bi bilingualFile</code>
| A translation resource connector to use to translate the document: <code>-pen</code> for the [[Pensieve TM Connector]], <code>-tt</code> for the [[Translate Toolkit TM Connector]], <code>-gs</code> for the [[GlobalSight TM Connector]], <code>-mm</code> for [[MyMemory TM Connector]], <code>-gg</code> for the [[Google MT v2 Connector]], <code>-apertium</code> for the [[Apertium MT Connector]], <code>-ms</code> for the [[Microsoft Translator Connector]], <code>-tda</code> for the [[TDA Translation Repository Connector]], <code>-lingo24</code> for the [[Lingo24 Premium MT Connector]], <code>-mmt</code> for the [[ModernMT API Connector]] and <code>-bi</code> for the [[Bilingual File Connector]].

The leveraging occurs after segmentation, if you have specified segmentation rules.

Note that some Internet-based resource may be slow and result in lengthy processing time. Be also aware that some translation resources may not always provide a good handling of inline codes.
|- valign="top"
| <code>-opt threshold</code> || TM query option: The threshold is a number between 0 and 100. If this option is not set the default is 95. Note that this option may be limited for some search engines because of the way they are configured.
|- valign="top"
| <code>-maketmx [tmxFile]</code> || Generates a TMX document with all the entries leveraged. You can specify the name of the document, if you do not it will be named <code>pretrans.tmx</code>.
|- valign="top"
| <code>-nocopy</code> || Ensures that the generated XLIFF files do not have a copy of the source text in the target entries if the original target does not exists.
|- valign="top"
| <code>-noalttrans</code> || Ensures that the generated XLIFF files do not have added <code><alt-trans></code> elements.
|- valign="top"
| <code>-codeattrs</code> || Enables the output of extended attributes <code>ctype</code> and <code>equiv-text</code> for inline codes.
|- valign="top"
| <code>-safe</code> || Shows a warning before overwriting output files.
|}

For example:

tikal -x *.docx *.html

Extracts all <code>.docx</code> and .html files in the current directory into corresponding <code>.docx.xlf</code> and <code>.html.xlf</code> XLIFF documents. The source language here is the default, which is the current language of the system. The target language by default is <code>fr</code>. No segmentation is done.

tikal -x -sl EN tl DE -fc okf_regex-srt -ie iso-8859-1 findingNemo.srt

Extracts the sub-title file <code>findingNemo.srt</code> into a <code>findingNemo.srt.xlf</code> XLIFF document. The encoding <code>iso-8859-</code>1 is used to process the input file. The filter used is the [[Regex Filter]] with the predefined configuration for SRT documents. The source language is English (<code>EN</code>) and the target language is German (<code>DE</code>). No segmentation is done.

tikal -x *.docx -seg -tl BR

Extracts all <code>.docx</code> files in the current directory into corresponding <code>.docx.xlf</code> XLIFF documents. The source language here is the default, which is the current language of the system. The target language is Breton (<code>BR</code>). The extracted text units will be segmented according the rules defined in the default SRX segmentation rules file (located in the <code>config</code> sub-directory in your Okapi main directory).

tikal -x *.odt -od toTrans -tl ZU

Extracts all <code>.odt</code> files in the current directory into corresponding <code>.odt.xlf</code> XLIFF documents into the <code>toTrans</code> sub-directory of the current directory. The source language here is the default, which is the current language of the system. The target language is Breton (<code>ZU</code>). The extracted text units will be segmented according the rules defined in the default SRX segmentation rules file (located in the <code>config</code> sub-directory in your Okapi main directory).

==Merge Files==

This command merges back into their original format one or more XLIFF documents that were created using the [[#Extract Files|Extract Files command]].

The XLIFF document names must be the name of the original files with an additional <code>.xlf</code> extension. The new documents are created in the directories where the XLIFF documents are, with a <code>.out</code> extension pre-pended to the original extension. For example, if your original file is <code>myFile.html</code>, the XLIFF document should be <code>myFile.html.xlf</code>, and the merged file will be <code>myFile.out.html</code>.

The syntax of this command is:

-m [options] xliffFile [xliffFile2...]

Where the options are:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
|nowrap="nowrap"| <code>-fc configId</code> || The identifier of the filter configuration to use for the re-extraction of the original file.
|- valign="top"
| <code>-ie encoding</code> || The encoding name of the original files. This is used only if the filter cannot detect the encoding from the input file itself.
|- valign="top"
| <code>-oe encoding</code> || The encoding name of the file to generate. The same encoding as the input file will be used if this option is not specified.
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-sd sourceDirectory</code> || The directory where to find the source file of the XLIFF document.
|- valign="top"
| <code>-od outputDirectory</code> || The directory where to place the output.
|}

For example:

tikal -m *.xlf -sl EN -tl DE

Merges all XLIFF documents in the directory. The skeleton files should be in the same directory as well. The source language is English and the target language is German.

tikal -m toTrans/*.xlf -sl EN -tl ZU -sd . -od xlated

Merges all XLIFF documents in the <code>toTrans</code> sub-directory of the current directory. The skeleton files are in the current directory. The merged files are placed in the <code>xlated</code> sub-directory of the current directory. The source language is English and the target language is Zulu.

==Extract Files to Moses==

This command extracts the translatable content of one or more given files into a [[Moses Text Filter|text format usable by Moses]]. You can then perform various tasks on this document.

The Moses files created are placed in the same directories as the original files, and have the same name with an additional extension that is the code of the source locale.

If the option <code>-2</code> is set, the target output has the same name as the source output, but with an extension that is the code of the target locale, except if the source file ends with an extension that it the code of the source locale. In that case, the target file takes the name of the source file with the last extension replaced by the code of the target language. For example, if the English source output is <code>out.txt</code>, the target output for French is <code>out.txt.fr</code>. If the English source output is <code>out.en</code>, the target output for French is <code>out.fr</code>.

The syntax of this command is:

-xm [options] inputFile

Where the options are:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code>-fc configId</code> || The identifier of the filter configuration to use for the extraction.
|- valign="top"
| <code>-ie encoding</code> || The encoding name of the input files. This is used only if the filter cannot detect the encoding from the input file itself.
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language of the input files. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language (used in the input if the input documents are multilingual). [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-2</code> || Extract two files: one for the source, one for the target. The target file has as many lines as the source file as lines. If there is an existing target segment, the target segment is extracted, otherwise an empty line is used for the missing target.
|- valign="top"
|nowrap="nowrap"|<code>-to srcOutputFile</code> || The path of the Moses source file to generate. The last part of the path is the template filename to use, the code of the source language is added automatically. Warning: You must not use this option if you are processing several files at the same time.
|- valign="top"
|nowrap="nowrap"| <code>-seg [srxFile]</code> || The segmentation rules to utilize. To specify the default rules that come with the installation, use <code>-seg</code> without filename. The default rules are in <code>config/defaultSegmentation.srx</code> in your Okapi main directory.
|- valign="top"
| <code>-rd rootDirectory</code> || The root directory (by default the user's home directory).
|}

For example:

tikal -xm myFile.html

Extracts the HTML file <code>myFile.html</code> in the current directory into corresponding <code>myFile.html.en</code> Moses document, assuming the default source language is English.

tikal -xm myFile.xlf -2 -sl en-us -tl af

Extracts the content of the XLIFF document <code>myFile.xlf</code> into two Moses InlineText files. The first one named <code>myFile.xlf.en-us</code> for the source, the second called <code>myFile.xlf.af</code> for the target.

tikal -xm myFile.xlf -2 -to out.txt -sl en -tl zu

Extracts the content of the XLIFF document <code>myFile.xlf</code> into two Moses InlineText files. The first one named <code>out.txt.en</code> for the source, the second called <code>out.txt.zu</code> for the target.

==Leverage Files from Moses==

This command takes an input file, and leverages the translation found in its corresponding Moses InlineText files. The initial InlineText file should be created with the [[#Extract Files to Moses|Extract Files to Moses command]].

The filter configuration, input encoding and segmentation parameters must be the same in this command as they were in the extraction command. This is to ensure the entries between the input file and the corresponding Moses file match one-to-one.

The new documents are created in the directories where the input documents are, with a <code>.out</code> extension pre-pended to the original extension. For example, if your original file is <code>myFile.xlf</code>, and your target language is Zulu, the Moses InlineText document should be (by default) <code>myFile.xlf.zu</code>, and the leveraged result file will be <code>myFile.out.xlf</code>.

The syntax of this command is:

-lm [options] inputFile

Where the options are:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code>-fc configId</code> || The identifier of the filter configuration to use for the extraction.
|- valign="top"
| <code>-ie encoding</code> || The encoding name of the input files. This is used only if the filter cannot detect the encoding from the input file itself.
|- valign="top"
| <code>-oe encoding</code> || The encoding name of the file to generate. The same encoding as the input file will be used if this option is not specified.
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language of the input files. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-totrg</code> || Copy the leveraged translation into the target, except if there is already an existing target content.
|- valign="top"
| <code>-overtrg</code> || Copy the leveraged translation into the target, even if is already an existing target content.
|- valign="top"
| <code>-bpt</code> || Use the <code><bpt>/<ept>/<ph></code> notation instead of the <code><g>/<x></code> notation in <code><alt-trans></code> elements.
|- valign="top"
|nowrap="nowrap"|<code>-from mosesFile</code> || The path of the Moses InlineText file from which to leverage the text. If this option is not set, the file to leverage from is the same as the input file with the language code of the target appended as an extension. Warning: You must not use this option if you are processing several files at the same time.
|- valign="top"
|nowrap="nowrap"|<code>-to outputFile</code> || The path of the output file to generate. If this option is not set, the output is the same as the input with <code>.out</code> prepended to the file extension. Warning: You must not use this option if you are processing several files at the same time.
|- valign="top"
|nowrap="nowrap"|<code>-seg [srxFile]</code> || The segmentation rules to utilize. To specify the default rules that come with the installation, use <code>-seg</code> without filename. The default rules are in <code>config/defaultSegmentation.srx</code> in your Okapi main directory.
|- valign="top"
|nowrap="nowrap"|<code>-rd rootDirectory</code> || The root directory (by default the user's home directory).
|- valign="top"
| <code>-noalttrans</code> || Ensures that the generated XLIFF files do not have added <code><alt-trans></code> elements.
|}

For example:

tikal -lm myFile.html -tl zh

Puts in the input file <code>myFile.html</code> from the corresponding <code>myFile.html.zh</code> Moses InlineText document. The source language is the default for the platform, while the target is Chinese (zh). The output file is <code>myFile.out.html</code>.

tikal -lm myFile.xlf -sl en -tl ja -from trans.txt -totrg

Leverages the XLIFF file named <code>myFile.xlf</code> in the current directory using the Moses InlineText document named <code>trans.txt</code>. The source language is English and the target is Japanese. The output file is <code>myFile.out.xlf</code>. The Moses translation is copied into the <code><target></code> elements only if the element is empty or non-existing in the source document. The <code><alt-trans></code> elements are added for all segments.

[[Category:Tikal]] [[Category:Filters]] [[Category:XLIFF]]

Tikal - Extraction Commands

2021-04-09T18:00:19Z

Kuro2: Removing mentions of -x1, -x2, and skeleton file as they are obsolete.

{{Tikal Common Menu}}
__TOC__
==Extract Files==

This command extracts the translatable content of one or more given files into an [[XLIFF|XLIFF document]]. You can then use any XLIFF-aware translation tool to translate the document (See "[[How to Translate XLIFF Documents]]" for more information). When the translation is done, you can use the [[#Merge Files|Merge Files command]] to create a new translated file in its original format.

The XLIFF documents created are placed in the same directories as the original files, and have the same name with an additional .xlf extension.

By default, some extensions are mapped to a specific filter configuration (for example: <code>.docx</code> to <code>okf_openxml</code>, <code>.odt</code> to <code>okf_openoffice</code>, <code>.po</code> to <code>okf_po</code>, etc.). But you can define your own configuration and specify it as well using the <code>-fc</code> option. To get a list of all available filter configurations use the [[Tikal - Miscellaneous Commands#List Filter Configurations|List Filter Configurations command]]. For more details the filters available and their configurations, see each [[Filters|filter's documentation]].

You can use the <code>-seg</code> option to specify that the extracted text should be segmented. Use <code>-seg</code> without file name to use the default segmentation rules, use "<code>-seg myRules.srx</code>" to specify your own rules. The rules file must be in SRX format. The segments are marked up according the XLIFF 1.2 specifications.

The syntax of this command is:

-x [options] inputFile [inputFile2...]

Where the options are:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code>-fc configId</code> || The identifier of the filter configuration to use for the extraction.
|- valign="top"
| <code>-ie encoding</code> || The encoding name of the input files. this is used only if the filter cannot detect the encoding from the input file itself.
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language of the input files. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language for the output (also used in the input if the input documents are multilingual). [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-seg [srxFile]</code> || The segmentation rules to utilize. To specify the default rules that come with the installation, use <code>-seg</code> without filename. The default rules are in <code>config/defaultSegmentation.srx</code> in your Okapi main directory.
|- valign="top"
| <code>-rd rootDirectory</code> || The root directory (by default the user's home directory).
|- valign="top"
| <code>-od outputDirectory</code> || The directory where to place the output.
|- valign="top"
|nowrap="nowrap"| <code>-pen tmDirectory| -tt [hostname[:port]]| -gs configFile| -mm [key]| -gg configFile| -apertium [configFile]| -ms configFile| -tda configFile| -lingo24 configFile| -mmt url context| -bi bilingualFile</code>
| A translation resource connector to use to translate the document: <code>-pen</code> for the [[Pensieve TM Connector]], <code>-tt</code> for the [[Translate Toolkit TM Connector]], <code>-gs</code> for the [[GlobalSight TM Connector]], <code>-mm</code> for [[MyMemory TM Connector]], <code>-gg</code> for the [[Google MT v2 Connector]], <code>-apertium</code> for the [[Apertium MT Connector]], <code>-ms</code> for the [[Microsoft Translator Connector]], <code>-tda</code> for the [[TDA Translation Repository Connector]], <code>-lingo24</code> for the [[Lingo24 Premium MT Connector]], <code>-mmt</code> for the [[ModernMT API Connector]] and <code>-bi</code> for the [[Bilingual File Connector]].

The leveraging occurs after segmentation, if you have specified segmentation rules.

Note that some Internet-based resource may be slow and result in lengthy processing time. Be also aware that some translation resources may not always provide a good handling of inline codes.
|- valign="top"
| <code>-opt threshold</code> || TM query option: The threshold is a number between 0 and 100. If this option is not set the default is 95. Note that this option may be limited for some search engines because of the way they are configured.
|- valign="top"
| <code>-maketmx [tmxFile]</code> || Generates a TMX document with all the entries leveraged. You can specify the name of the document, if you do not it will be named <code>pretrans.tmx</code>.
|- valign="top"
| <code>-nocopy</code> || Ensures that the generated XLIFF files do not have a copy of the source text in the target entries if the original target does not exists.
|- valign="top"
| <code>-noalttrans</code> || Ensures that the generated XLIFF files do not have added <code><alt-trans></code> elements.
|- valign="top"
| <code>-codeattrs</code> || Enables the output of extended attributes <code>ctype</code> and <code>equiv-text</code> for inline codes.
|- valign="top"
| <code>-safe</code> || Shows a warning before overwriting output files.
|}

For example:

tikal -x *.docx *.html

Extracts all <code>.docx</code> and .html files in the current directory into corresponding <code>.docx.xlf</code> and <code>.html.xlf</code> XLIFF documents. The source language here is the default, which is the current language of the system. The target language by default is <code>fr</code>. No segmentation is done.

tikal -x -sl EN tl DE -fc okf_regex-srt -ie iso-8859-1 findingNemo.srt

Extracts the sub-title file <code>findingNemo.srt</code> into a <code>findingNemo.srt.xlf</code> XLIFF document. The encoding <code>iso-8859-</code>1 is used to process the input file. The filter used is the [[Regex Filter]] with the predefined configuration for SRT documents. The source language is English (<code>EN</code>) and the target language is German (<code>DE</code>). No segmentation is done.

tikal -x *.docx -seg -tl BR

Extracts all <code>.docx</code> files in the current directory into corresponding <code>.docx.xlf</code> XLIFF documents. The source language here is the default, which is the current language of the system. The target language is Breton (<code>BR</code>). The extracted text units will be segmented according the rules defined in the default SRX segmentation rules file (located in the <code>config</code> sub-directory in your Okapi main directory).

tikal -x *.odt -od toTrans -tl ZU

Extracts all <code>.odt</code> files in the current directory into corresponding <code>.odt.xlf</code> XLIFF documents into the <code>toTrans</code> sub-directory of the current directory. The source language here is the default, which is the current language of the system. The target language is Breton (<code>ZU</code>). The extracted text units will be segmented according the rules defined in the default SRX segmentation rules file (located in the <code>config</code> sub-directory in your Okapi main directory).

==Merge Files==

This command merges back into their original format one or more XLIFF documents that were created using the [[#Extract Files|Extract Files command]].

The XLIFF document names must be the name of the original files with an additional <code>.xlf</code> extension. The new documents are created in the directories where the XLIFF documents are, with a <code>.out</code> extension pre-pended to the original extension. For example, if your original file is <code>myFile.html</code>, the XLIFF document should be <code>myFile.html.xlf</code>, and the merged file will be <code>myFile.out.html</code>.

The syntax of this command is:

-m [options] xliffFile [xliffFile2...]

Where the options are:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
|nowrap="nowrap"| <code>-fc configId</code> || The identifier of the filter configuration to use for the re-extraction of the original file.
|- valign="top"
| <code>-ie encoding</code> || The encoding name of the original files. This is used only if the filter cannot detect the encoding from the input file itself.
|- valign="top"
| <code>-oe encoding</code> || The encoding name of the file to generate. The same encoding as the input file will be used if this option is not specified.
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-sd sourceDirectory</code> || The directory where to find the source file of the XLIFF document.
|- valign="top"
| <code>-od outputDirectory</code> || The directory where to place the output.
|}

For example:

tikal -m *.xlf -sl EN -tl DE

Merges all XLIFF documents in the directory. The skeleton files should be in the same directory as well. The source language is English and the target language is German.

tikal -m toTrans/*.xlf -sl EN -tl ZU -sd . -od xlated

Merges all XLIFF documents in the <code>toTrans</code> sub-directory of the current directory. The skeleton files are in the current directory. The merged files are placed in the <code>xlated</code> sub-directory of the current directory. The source language is English and the target language is Zulu.

==Extract Files to Moses==

This command extracts the translatable content of one or more given files into a [[Moses Text Filter|text format usable by Moses]]. You can then perform various tasks on this document.

The Moses files created are placed in the same directories as the original files, and have the same name with an additional extension that is the code of the source locale.

If the option <code>-2</code> is set, the target output has the same name as the source output, but with an extension that is the code of the target locale, except if the source file ends with an extension that it the code of the source locale. In that case, the target file takes the name of the source file with the last extension replaced by the code of the target language. For example, if the English source output is <code>out.txt</code>, the target output for French is <code>out.txt.fr</code>. If the English source output is <code>out.en</code>, the target output for French is <code>out.fr</code>.

The syntax of this command is:

-xm [options] inputFile

Where the options are:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code>-fc configId</code> || The identifier of the filter configuration to use for the extraction.
|- valign="top"
| <code>-ie encoding</code> || The encoding name of the input files. This is used only if the filter cannot detect the encoding from the input file itself.
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language of the input files. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language (used in the input if the input documents are multilingual). [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-2</code> || Extract two files: one for the source, one for the target. The target file has as many lines as the source file as lines. If there is an existing target segment, the target segment is extracted, otherwise an empty line is used for the missing target.
|- valign="top"
|nowrap="nowrap"|<code>-to srcOutputFile</code> || The path of the Moses source file to generate. The last part of the path is the template filename to use, the code of the source language is added automatically. Warning: You must not use this option if you are processing several files at the same time.
|- valign="top"
|nowrap="nowrap"| <code>-seg [srxFile]</code> || The segmentation rules to utilize. To specify the default rules that come with the installation, use <code>-seg</code> without filename. The default rules are in <code>config/defaultSegmentation.srx</code> in your Okapi main directory.
|- valign="top"
| <code>-rd rootDirectory</code> || The root directory (by default the user's home directory).
|}

For example:

tikal -xm myFile.html

Extracts the HTML file <code>myFile.html</code> in the current directory into corresponding <code>myFile.html.en</code> Moses document, assuming the default source language is English.

tikal -xm myFile.xlf -2 -sl en-us -tl af

Extracts the content of the XLIFF document <code>myFile.xlf</code> into two Moses InlineText files. The first one named <code>myFile.xlf.en-us</code> for the source, the second called <code>myFile.xlf.af</code> for the target.

tikal -xm myFile.xlf -2 -to out.txt -sl en -tl zu

Extracts the content of the XLIFF document <code>myFile.xlf</code> into two Moses InlineText files. The first one named <code>out.txt.en</code> for the source, the second called <code>out.txt.zu</code> for the target.

==Leverage Files from Moses==

This command takes an input file, and leverages the translation found in its corresponding Moses InlineText files. The initial InlineText file should be created with the [[#Extract Files to Moses|Extract Files to Moses command]].

The filter configuration, input encoding and segmentation parameters must be the same in this command as they were in the extraction command. This is to ensure the entries between the input file and the corresponding Moses file match one-to-one.

The new documents are created in the directories where the input documents are, with a <code>.out</code> extension pre-pended to the original extension. For example, if your original file is <code>myFile.xlf</code>, and your target language is Zulu, the Moses InlineText document should be (by default) <code>myFile.xlf.zu</code>, and the leveraged result file will be <code>myFile.out.xlf</code>.

The syntax of this command is:

-lm [options] inputFile

Where the options are:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code>-fc configId</code> || The identifier of the filter configuration to use for the extraction.
|- valign="top"
| <code>-ie encoding</code> || The encoding name of the input files. This is used only if the filter cannot detect the encoding from the input file itself.
|- valign="top"
| <code>-oe encoding</code> || The encoding name of the file to generate. The same encoding as the input file will be used if this option is not specified.
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language of the input files. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-totrg</code> || Copy the leveraged translation into the target, except if there is already an existing target content.
|- valign="top"
| <code>-overtrg</code> || Copy the leveraged translation into the target, even if is already an existing target content.
|- valign="top"
| <code>-bpt</code> || Use the <code><bpt>/<ept>/<ph></code> notation instead of the <code><g>/<x></code> notation in <code>&lat-trans></code> elements.
|- valign="top"
|nowrap="nowrap"|<code>-from mosesFile</code> || The path of the Moses InlineText file from which to leverage the text. If this option is not set, the file to leverage from is the same as the input file with the language code of the target appended as an extension. Warning: You must not use this option if you are processing several files at the same time.
|- valign="top"
|nowrap="nowrap"|<code>-to outputFile</code> || The path of the output file to generate. If this option is not set, the output is the same as the input with <code>.out</code> prepended to the file extension. Warning: You must not use this option if you are processing several files at the same time.
|- valign="top"
|nowrap="nowrap"|<code>-seg [srxFile]</code> || The segmentation rules to utilize. To specify the default rules that come with the installation, use <code>-seg</code> without filename. The default rules are in <code>config/defaultSegmentation.srx</code> in your Okapi main directory.
|- valign="top"
|nowrap="nowrap"|<code>-rd rootDirectory</code> || The root directory (by default the user's home directory).
|- valign="top"
| <code>-noalttrans</code> || Ensures that the generated XLIFF files do not have added <code><alt-trans></code> elements.
|}

For example:

tikal -lm myFile.html -tl zh

Puts in the input file <code>myFile.html</code> from the corresponding <code>myFile.html.zh</code> Moses InlineText document. The source language is the default for the platform, while the target is Chinese (zh). The output file is <code>myFile.out.html</code>.

tikal -lm myFile.xlf -sl en -tl ja -from trans.txt -totrg

Leverages the XLIFF file named <code>myFile.xlf</code> in the current directory using the Moses InlineText document named <code>trans.txt</code>. The source language is English and the target is Japanese. The output file is <code>myFile.out.xlf</code>. The Moses translation is copied into the <code><target></code> elements only if the element is empty or non-existing in the source document. The <code><alt-trans></code> elements are added for all segments.

[[Category:Tikal]] [[Category:Filters]] [[Category:XLIFF]]

Filters

2021-03-24T00:29:37Z

Kuro2: /* Supported File Formats */

Filters are the components that convert input documents from their native file format into a common internal set of [[Glossary#Resource|resources]] that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the [[Raw Document to Filter Events Step]] and the re-writing by the [[Filter Events to Raw Document Step]].

Note: The [[Okapi Filters Plugin for OmegaT]] allows you to use some of the filters directly from [http://www.omegat.org OmegaT].

==List of the Filters==

The framework distribution comes with the following filters:

{| cellpadding="8" width=100%
|- valign="top"
|
* [[Archive Filter]]
* [[DTD Filter]]
* [[Doxygen Filter]]
* [[HTML Filter]]
* [[HTML5-ITS Filter]]
* [[ICML Filter]]
* [[IDML Filter]]
* [[JSON Filter]]
* [[Markdown Filter]]
* [[MIF Filter]]
* [[Moses Text Filter]]
* [[Multi-Parsers Filter]]
* [[OpenOffice Filter]]
* [[OpenXML Filter|OpenXML (MS Office) Filter]]
|
* [[PDF Filter]]
* [[Pensieve TM Filter]]
* [[PHP Content Filter]]
* [[Plain Text Filter]]
* [[PO Filter]]
* [[Properties Filter]]
* [[Rainbow Translation Kit Filter]]
* [[Regex Filter]]
* [[SDL Trados Package Filter]]
* [[Simplification Filter]]
* [[Table Filter]]
* [[TMX Filter]]
* [[Trados-Tagged RTF Filter]]
|
* [[Transifex Filter]]
* [[TS Filter]]
* [[TTX Filter]]
* [[TXML Filter]]
* [[Wiki Filter]]
* [[Vignette Filter]]
* [[XLIFF Filter]]
* [[XLIFF-2 Filter]]
* [[XML Filter]]
* [[XML Stream Filter]]
* [[YAML Filter]]
|}

==Supported File Formats==

The following is a list of some of the file formats supported by the distribution through [[Understanding Filter Configurations|pre-defined configurations]]:

{| border="1" cellpadding="6" cellspacing="0"
|+
| '''Format''' || '''Extensions''' || '''Pre-Defined Configuration''' || '''Filter''' || '''Notes'''
|- valign="top"
| Android Strings || .xml || <code>okf_xml-AndroidStrings</code> || [[XML Filter]] ||
|- valign="top"
| Apple Stringsdict || .stringsdict || <code>okf_xml-AppleStringsdict</code> || [[XML Filter]] ||
|- valign="top"
| Archive || .zip || <code>okf_archive</code> || [[Archive Filter]] || Meta filter that processes zip files with various formats as one file.
|- valign="top"
| Auto Xliff || .xlf, .xliff || <code>okf_autoxliff</code> || [[Auto Xliff Filter]] || Detects the version of an XLIFF file and then hands parsing off to the appropriate filter
|- valign="top"
| CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]] ||
|- valign="top"
| CSV (Multiple complex sub-formats) || .csv || <code>okf_multiparsers</code> || [[Multi-Parsers Filter]] ||
|- valign="top"
| DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]] ||
|- valign="top"
| DocBook v5.0 || .xml || <code>okf_xml-docbook</code> || [[XML Filter]] || Since Okapi 1.42. <footnote> is not handled properly.
|- valign="top"
| DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]] ||
|- valign="top"
| Doxygen-commented files || .c, .h, cpp || <code>okf_doxygen</code> || [[Doxygen Filter]] ||
|- valign="top"
| DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]] ||
|- valign="top"
| Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]] ||
|- valign="top"
| Idiom WorldServer XLIFF || .xlf || <code>okf_xliff-iws</code> || [[XLIFF Filter]] ||
|- valign="top"
| InCopy ICML || .wcml || <code>okf_icml</code> || [[ICML Filter]] ||
|- valign="top"
| InDesign IDML || .idml || <code>okf_idml</code> || [[IDML Filter]] ||
|- valign="top"
| iOS/Mac Strings|| .strings || <code>okf_regex-macStrings</code> || [[Regex Filter]] ||
|- valign="top"
| Java Properties || .properties || <code>okf_properties</code> || [[Properties Filter]] ||
|- valign="top"
| Java Properties (Output not escaped) || .properties || <code>okf_properties-outputNotEscaped</code> || [[Properties Filter]] ||
|- valign="top"
| Java XML Properties || .xml || <code>okf_xml-JavaProperties</code> || [[XML Filter]] ||
|- valign="top"
| Java XML Properties (HTML strings) || .xml || <code>okf_xmlstream-JavaPropertiesHTML</code> || [[XML Stream Filter]] ||
|- valign="top"
| JSON || .json || <code>okf_json</code> || [[JSON Filter]] ||
|- valign="top"
| Haiku CatKeys || .catkeys || <code>okf_table_catkeys</code> || [[Table Filter]] ||
|- valign="top"
| HTML (any) || .html, .htm || <code>okf_html</code> || [[HTML Filter]] ||
|- valign="top"
| HTML (Well-formed, and XHTML) || .html, .htm|| <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] ||
|- valign="top"
| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]] ||
|- valign="top"
| Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft PowerPoint 2007/2010 || .pptx, .pptm, .potx, .potm, .ppsx, .ppsm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Visio || .vsdx, .vsdm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Word 2007/2010 || .docx, .docm, .dotx, .dotm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| MIF || .mif || <code>okf_mif</code> || [[MIF Filter]] ||
|- valign="top"
| Moses Text || .txt || <code>okf_mosestext</code> || [[Moses Text Filter]] ||
|- valign="top"
| OpenOffice.org Calc || .ods, .ots || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Draw || .odg, .otg || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Impress || .odp, .otp || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Writer || .odt, .ott || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| PDF || .pdf || <code>okf_pdf</code> || [[PDF Filter]] ||
|- valign="top"
| [[Pensieve TM]] || .pentm || <code>okf_pensieve</code> || [[Pensieve TM Filter]] ||
|- valign="top"
| PHP Content || .php || <code>okf_phpcontent</code> || [[PHP Content Filter]] || Can be used as a subfilter only
|- valign="top"
| Plain Text (Line = text unit) || .txt || <code>okf_plaintext</code> || [[ Plain Text Filter]] ||
|- valign="top"
| Plain Text (Paragraph = text unit) || .txt || <code>okf_plaintext_paragraphs</code> || [[Plain Text Filter]] ||
|- valign="top"
| PO || .po || <code>okf_po</code> || [[PO Filter]] ||
|- valign="top"
| PO (Monolingual style) || .po || <code>okf_po-monolingual</code> || [[PO Filter]] ||
|- valign="top"
| Rainbow Translation Kit manifests || .rkm || <code>okf_rainbowkit</code> || [[Rainbow Translation Kit Filter]] || Used as a tkit reader only
|- valign="top"
| Regex (Any text-based format) || .txt || <code>okf_regex</code> || [[Regex Filter]] ||
|- valign="top"
| RDF (Mozilla RDF) || .rdf || <code>okf_xml-MozillaRDF</code> || [[XML Filter]] ||
|- valign="top"
| RESX || .resx || <code>okf_xml-resx</code> || [[XML Filter]] ||
|- valign="top"
| SDLPPX || .sdlppx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDLRPX || .sdlrpx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDL[[XLIFF]] || .sdlxlf || <code>okf_xliff-sdl</code> || [[XLIFF Filter]] ||
|- valign="top"
| Skype Language Files || .lang || <code>okf_properties-skypeLang</code> || [[Properties Filter]] ||
|- valign="top"
| SRT (Sub-Rip Text, sub-titles files) || .srt || <code>okf_regex-srt</code> || [[Regex Filter]] ||
|- valign="top"
| Tab-Delimiter files || .tsv, .txt || <code>okf_table_tsv</code> || [[Table Filter]] ||
|- valign="top"
| Tex files || .tex || <code>okf_tex</code> || [[TEX Filter]] ||
|- valign="top"
| [[TMX]] || .tmx || <code>okf_tmx</code> || [[TMX Filter]] ||
|- valign="top"
| Transifex project || .txp || <code>okf_transifex</code> || [[Transifex Filter]] ||
|- valign="top"
| Trados-Tagged RTF || .rtf || <code>okf_tradosrtf</code> || [[Trados-Tagged RTF Filter]] ||
|- valign="top"
| TS - Qt TS files || .ts || <code>okf_ts</code> || [[TS Filter]] ||
|- valign="top"
| TTX - Trados TagEditor TTX files || .ttx || <code>okf_ttx</code> || [[TTX Filter]] ||
|- valign="top"
| TXML - Wordfast Pro TXML files || .txml || <code>okf_txml</code> || [[TXML Filter]] ||
|- valign="top"
| Vignette Export/Import Content || .xml || <code>okf_vignette</code> || [[Vignette Filter]] ||
|- valign="top"
| XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| WIX (Windows Installer XML) localization files || .wix || <code>okf_xml-WixLocalization</code> || [[XML Filter]] ||
|- valign="top"
| [[XLIFF]] v1.2 || .xlf, .xliff || <code>okf_xliff</code> || [[XLIFF Filter]] ||
|- valign="top"
| [[XLIFF]] v2 || .xlf || <code>okf_xliff2</code> || [[XLIFF-2 Filter]] ||
|- valign="top"
| XML (Generic, using [[ITS]] defaults) || .xml || <code>okf_xml</code> || [[XML Filter]] ||
|- valign="top"
| XML (Generic, using stream reader) || .xml || <code>okf_xmlstream</code> || [[XML Stream Filter]] ||
|- valign="top"
| YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]] ||
|}

Note that most filters allow you to [[Understanding Filter Configurations|create your own configurations]] to support more file formats.

==Code Simplification Rules==

All filters support code simplification rules. By default the [[Inline Codes Simplifier Step]], [[Simplification Filter]] and [[Post-segmentation Inline Codes Removal Step]] maximize the trimming and merging (aka simplification) of inline codes. In some cases this may not be desired. The simplification rules allow you to override the default behavior and prevent specific codes from being trimmed or merged.

===General Syntax===

The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.

For more details see the JavaCC grammar: <code>../okapi-core/src/main/java/net/sf/okapi/core/simplifierrules/SimplifierRules.jj</code>

===Rule Examples===

If Code has any of these flags then don't simplify

<pre>if DELETABLE or ADDABLE or CLONEABLE;</pre>

"=" is string match
Match basic TAGTYPE opening, closing or standalone

<pre>if DATA = "a" and TAGTYPE = OPENING;</pre>

"~" is regex match

<pre>if DATA ~ "a.*";</pre>

You can negate any of the match operators
Don't simplify if the DATA does not match the regex

<pre>if DATA !~ "a.*";</pre>

Match on type, linebreak in this case, don't simplify

<pre>if the Code is a linebreak if TYPE = "lb";</pre>

Don't simplify any rich text types

<pre>if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";</pre>

Expressions can be recursive (supports embedded parens)

<pre>if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));</pre>

===Filter Config Examples===

Examples of using simplifier rules within the filter config formats used by Okapi.

'''YAML:'''

<pre>
simplifierRules: |
if ADDABLE or DELETABLE or CLONEABLE;
if DATA = " " or DATA = "" or DATA = "" or DATA = "</a>";
if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";
</pre>

'''ITS:'''

<pre>
<?xml version="1.0" encoding="UTF-8"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options">

<its:translateRule selector="//*" translate="yes"/>
<its:withinTextRule selector="//codeph" withinText="yes"/>
<its:withinTextRule selector="//ph" withinText="yes"/>
<okp:simplifierRules>
if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</okp:simplifierRules>
</its:rules>
</pre>

'''FPRM (Parameters):'''

<pre>
#v1
extractNotes.b=true
simplifyCodes.b=true
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</pre>

==Font Mapping==

The font mapping can be considered as filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX documents) filters at the moment.

The following font mapping configuration options are available:
* The source language regular expression pattern: <code>en.*</code>, <code>en-UK</code>, etc. It can be left empty to apply the mapping to any source language.
* The target language regular expression pattern: <code>ru.*</code>, <code>ru-RU</code>, etc. It can be left empty to apply the mapping to any target language.
* The source font name regular expression pattern: <code>Arial.*</code>, <code>Times New Roman</code>, etc. It can be left empty to apply the mapping to any source font name found.
* The target font name: <code>Arial</code>, <code>Times New Roman</code>, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.

Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential
substitution of the source font values. I.e. if there is more than one mapping:
# <code>Arial</code> -> <code>Times New Roman</code>
# <code>Times New Roman</code> -> <code>Sans Serif</code>
then the first mapping will produce <code>Times New Roman</code> replacement and the second one will be applied to this new value, thus, ending up with the <code>Sans Serif</code>.

The parameters serialisation format can look like that:

<pre>
fontMappings.0.sourceLocalePattern=en.*
fontMappings.0.targetLocalePattern=ru.*
fontMappings.0.sourceFontPattern=Times.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.1.sourceLocalePattern=ru
fontMappings.1.targetLocalePattern=fr
fontMappings.1.sourceFontPattern=The Sims Sans
fontMappings.1.targetFont=Arial Unicode MS
fontMappings.number.i=2
</pre>

[[Category:Filters]]

Filters

2021-03-24T00:24:49Z

Kuro2: /* Supported File Formats */

Filters are the components that convert input documents from their native file format into a common internal set of [[Glossary#Resource|resources]] that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the [[Raw Document to Filter Events Step]] and the re-writing by the [[Filter Events to Raw Document Step]].

Note: The [[Okapi Filters Plugin for OmegaT]] allows you to use some of the filters directly from [http://www.omegat.org OmegaT].

==List of the Filters==

The framework distribution comes with the following filters:

{| cellpadding="8" width=100%
|- valign="top"
|
* [[Archive Filter]]
* [[DTD Filter]]
* [[Doxygen Filter]]
* [[HTML Filter]]
* [[HTML5-ITS Filter]]
* [[ICML Filter]]
* [[IDML Filter]]
* [[JSON Filter]]
* [[Markdown Filter]]
* [[MIF Filter]]
* [[Moses Text Filter]]
* [[Multi-Parsers Filter]]
* [[OpenOffice Filter]]
* [[OpenXML Filter|OpenXML (MS Office) Filter]]
|
* [[PDF Filter]]
* [[Pensieve TM Filter]]
* [[PHP Content Filter]]
* [[Plain Text Filter]]
* [[PO Filter]]
* [[Properties Filter]]
* [[Rainbow Translation Kit Filter]]
* [[Regex Filter]]
* [[SDL Trados Package Filter]]
* [[Simplification Filter]]
* [[Table Filter]]
* [[TMX Filter]]
* [[Trados-Tagged RTF Filter]]
|
* [[Transifex Filter]]
* [[TS Filter]]
* [[TTX Filter]]
* [[TXML Filter]]
* [[Wiki Filter]]
* [[Vignette Filter]]
* [[XLIFF Filter]]
* [[XLIFF-2 Filter]]
* [[XML Filter]]
* [[XML Stream Filter]]
* [[YAML Filter]]
|}

==Supported File Formats==

The following is a list of some of the file formats supported by the distribution through [[Understanding Filter Configurations|pre-defined configurations]]:

{| border="1" cellpadding="6" cellspacing="0"
|+
| '''Format''' || '''Extensions''' || '''Pre-Defined Configuration''' || '''Filter''' || '''Notes'''
|- valign="top"
| Android Strings || .xml || <code>okf_xml-AndroidStrings</code> || [[XML Filter]] ||
|- valign="top"
| Apple Stringsdict || .stringsdict || <code>okf_xml-AppleStringsdict</code> || [[XML Filter]] ||
|- valign="top"
| Archive || .zip || <code>okf_archive</code> || [[Archive Filter]] || Meta filter that processes zip files with various formats as one file.
|- valign="top"
| Auto Xliff || .xlf, .xliff || <code>okf_autoxliff</code> || [[Auto Xliff Filter]] || Detects the version of an XLIFF file and then hands parsing off to the appropriate filter
|- valign="top"
| CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]] ||
|- valign="top"
| CSV (Multiple complex sub-formats) || .csv || <code>okf_multiparsers</code> || [[Multi-Parsers Filter]] ||
|- valign="top"
| DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]] ||
|- valign="top"
| DocBook v5.0 || .xml || <code>okf_xml-docbook</code> || [[XML Filter]] || Since Okapi 1.42. <footnote> is not handled properly.
|- valign="top"
| DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]] ||
|- valign="top"
| Doxygen-commented files || .c, .h, cpp || <code>okf_doxygen</code> || [[Doxygen Filter]] ||
|- valign="top"
| DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]] ||
|- valign="top"
| Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]] ||
|- valign="top"
| Idiom WorldServer XLIFF || .xlf || <code>okf_xliff-iws</code> || [[XLIFF Filter]] ||
|- valign="top"
| InCopy ICML || .wcml || <code>okf_icml</code> || [[ICML Filter]] ||
|- valign="top"
| InDesign IDML || .idml || <code>okf_idml</code> || [[IDML Filter]] ||
|- valign="top"
| iOS/Mac Strings|| .strings || <code>okf_regex-macStrings</code> || [[Regex Filter]] ||
|- valign="top"
| Java Properties || .properties || <code>okf_properties</code> || [[Properties Filter]] ||
|- valign="top"
| Java Properties (Output not escaped) || .properties || <code>okf_properties-outputNotEscaped</code> || [[Properties Filter]] ||
|- valign="top"
| Java XML Properties || .xml || <code>okf_xml-JavaProperties</code> || [[XML Filter]] ||
|- valign="top"
| Java XML Properties (HTML strings) || .xml || <code>okf_xmlstream-JavaPropertiesHTML</code> || [[XML Stream Filter]] ||
|- valign="top"
| JSON || .json || <code>okf_json</code> || [[JSON Filter]] ||
|- valign="top"
| Haiku CatKeys || .catkeys || <code>okf_table_catkeys</code> || [[Table Filter]] ||
|- valign="top"
| HTML (any) || .html, .htm || <code>okf_html</code> || [[HTML Filter]] ||
|- valign="top"
| HTML (Well-formed, and XHTML) || .html, .htm|| <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] ||
|- valign="top"
| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]]
|- valign="top"
| Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft PowerPoint 2007/2010 || .pptx, .pptm, .potx, .potm, .ppsx, .ppsm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Visio || .vsdx, .vsdm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Word 2007/2010 || .docx, .docm, .dotx, .dotm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| MIF || .mif || <code>okf_mif</code> || [[MIF Filter]] ||
|- valign="top"
| Moses Text || .txt || <code>okf_mosestext</code> || [[Moses Text Filter]] ||
|- valign="top"
| OpenOffice.org Calc || .ods, .ots || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Draw || .odg, .otg || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Impress || .odp, .otp || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Writer || .odt, .ott || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| PDF || .pdf || <code>okf_pdf</code> || [[PDF Filter]] ||
|- valign="top"
| [[Pensieve TM]] || .pentm || <code>okf_pensieve</code> || [[Pensieve TM Filter]] ||
|- valign="top"
| PHP Content || .php || <code>okf_phpcontent</code> || [[PHP Content Filter]] || Can be used as a subfilter only
|- valign="top"
| Plain Text (Line = text unit) || .txt || <code>okf_plaintext</code> || [[ Plain Text Filter]] ||
|- valign="top"
| Plain Text (Paragraph = text unit) || .txt || <code>okf_plaintext_paragraphs</code> || [[Plain Text Filter]] ||
|- valign="top"
| PO || .po || <code>okf_po</code> || [[PO Filter]] ||
|- valign="top"
| PO (Monolingual style) || .po || <code>okf_po-monolingual</code> || [[PO Filter]] ||
|- valign="top"
| Rainbow Translation Kit manifests || .rkm || <code>okf_rainbowkit</code> || [[Rainbow Translation Kit Filter]] || Used as a tkit reader only
|- valign="top"
| Regex (Any text-based format) || .txt || <code>okf_regex</code> || [[Regex Filter]] ||
|- valign="top"
| RDF (Mozilla RDF) || .rdf || <code>okf_xml-MozillaRDF</code> || [[XML Filter]] ||
|- valign="top"
| RESX || .resx || <code>okf_xml-resx</code> || [[XML Filter]] ||
|- valign="top"
| SDLPPX || .sdlppx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDLRPX || .sdlrpx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDL[[XLIFF]] || .sdlxlf || <code>okf_xliff-sdl</code> || [[XLIFF Filter]] ||
|- valign="top"
| Skype Language Files || .lang || <code>okf_properties-skypeLang</code> || [[Properties Filter]] ||
|- valign="top"
| SRT (Sub-Rip Text, sub-titles files) || .srt || <code>okf_regex-srt</code> || [[Regex Filter]] ||
|- valign="top"
| Tab-Delimiter files || .tsv, .txt || <code>okf_table_tsv</code> || [[Table Filter]] ||
|- valign="top"
| Tex files || .tex || <code>okf_tex</code> || [[TEX Filter]] ||
|- valign="top"
| [[TMX]] || .tmx || <code>okf_tmx</code> || [[TMX Filter]] ||
|- valign="top"
| Transifex project || .txp || <code>okf_transifex</code> || [[Transifex Filter]] ||
|- valign="top"
| Trados-Tagged RTF || .rtf || <code>okf_tradosrtf</code> || [[Trados-Tagged RTF Filter]] ||
|- valign="top"
| TS - Qt TS files || .ts || <code>okf_ts</code> || [[TS Filter]] ||
|- valign="top"
| TTX - Trados TagEditor TTX files || .ttx || <code>okf_ttx</code> || [[TTX Filter]] ||
|- valign="top"
| TXML - Wordfast Pro TXML files || .txml || <code>okf_txml</code> || [[TXML Filter]] ||
|- valign="top"
| Vignette Export/Import Content || .xml || <code>okf_vignette</code> || [[Vignette Filter]] ||
|- valign="top"
| XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| WIX (Windows Installer XML) localization files || .wix || <code>okf_xml-WixLocalization</code> || [[XML Filter]] ||
|- valign="top"
| [[XLIFF]] v1.2 || .xlf, .xliff || <code>okf_xliff</code> || [[XLIFF Filter]] ||
|- valign="top"
| [[XLIFF]] v2 || .xlf || <code>okf_xliff2</code> || [[XLIFF-2 Filter]] ||
|- valign="top"
| XML (Generic, using [[ITS]] defaults) || .xml || <code>okf_xml</code> || [[XML Filter]] ||
|- valign="top"
| XML (Generic, using stream reader) || .xml || <code>okf_xmlstream</code> || [[XML Stream Filter]] ||
|- valign="top"
| YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]] ||
|}

Note that most filters allow you to [[Understanding Filter Configurations|create your own configurations]] to support more file formats.

==Code Simplification Rules==

All filters support code simplification rules. By default the [[Inline Codes Simplifier Step]], [[Simplification Filter]] and [[Post-segmentation Inline Codes Removal Step]] maximize the trimming and merging (aka simplification) of inline codes. In some cases this may not be desired. The simplification rules allow you to override the default behavior and prevent specific codes from being trimmed or merged.

===General Syntax===

The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.

For more details see the JavaCC grammar: <code>../okapi-core/src/main/java/net/sf/okapi/core/simplifierrules/SimplifierRules.jj</code>

===Rule Examples===

If Code has any of these flags then don't simplify

<pre>if DELETABLE or ADDABLE or CLONEABLE;</pre>

"=" is string match
Match basic TAGTYPE opening, closing or standalone

<pre>if DATA = "a" and TAGTYPE = OPENING;</pre>

"~" is regex match

<pre>if DATA ~ "a.*";</pre>

You can negate any of the match operators
Don't simplify if the DATA does not match the regex

<pre>if DATA !~ "a.*";</pre>

Match on type, linebreak in this case, don't simplify

<pre>if the Code is a linebreak if TYPE = "lb";</pre>

Don't simplify any rich text types

<pre>if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";</pre>

Expressions can be recursive (supports embedded parens)

<pre>if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));</pre>

===Filter Config Examples===

Examples of using simplifier rules within the filter config formats used by Okapi.

'''YAML:'''

<pre>
simplifierRules: |
if ADDABLE or DELETABLE or CLONEABLE;
if DATA = " " or DATA = "" or DATA = "" or DATA = "</a>";
if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";
</pre>

'''ITS:'''

<pre>
<?xml version="1.0" encoding="UTF-8"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options">

<its:translateRule selector="//*" translate="yes"/>
<its:withinTextRule selector="//codeph" withinText="yes"/>
<its:withinTextRule selector="//ph" withinText="yes"/>
<okp:simplifierRules>
if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</okp:simplifierRules>
</its:rules>
</pre>

'''FPRM (Parameters):'''

<pre>
#v1
extractNotes.b=true
simplifyCodes.b=true
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</pre>

==Font Mapping==

The font mapping can be considered as filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX documents) filters at the moment.

The following font mapping configuration options are available:
* The source language regular expression pattern: <code>en.*</code>, <code>en-UK</code>, etc. It can be left empty to apply the mapping to any source language.
* The target language regular expression pattern: <code>ru.*</code>, <code>ru-RU</code>, etc. It can be left empty to apply the mapping to any target language.
* The source font name regular expression pattern: <code>Arial.*</code>, <code>Times New Roman</code>, etc. It can be left empty to apply the mapping to any source font name found.
* The target font name: <code>Arial</code>, <code>Times New Roman</code>, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.

Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential
substitution of the source font values. I.e. if there is more than one mapping:
# <code>Arial</code> -> <code>Times New Roman</code>
# <code>Times New Roman</code> -> <code>Sans Serif</code>
then the first mapping will produce <code>Times New Roman</code> replacement and the second one will be applied to this new value, thus, ending up with the <code>Sans Serif</code>.

The parameters serialisation format can look like that:

<pre>
fontMappings.0.sourceLocalePattern=en.*
fontMappings.0.targetLocalePattern=ru.*
fontMappings.0.sourceFontPattern=Times.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.1.sourceLocalePattern=ru
fontMappings.1.targetLocalePattern=fr
fontMappings.1.sourceFontPattern=The Sims Sans
fontMappings.1.targetFont=Arial Unicode MS
fontMappings.number.i=2
</pre>

[[Category:Filters]]

ITS

2021-03-10T07:50:29Z

Kuro2: Fixing an oversight and adding a note about the namespace.

{{Standards Common Menu}}
__TOC__
==Overview==

The '''Internationalization Tag set (ITS)''' is a W3C recommendation that defines a set of elements and attributes you can use to specify different internationalization- and localization-related aspects of your XML document, for instance: ITS defines what attribute values are translatable, what element content should be protected, what element should be treated as a nested sub-flow of text, and much more.

* The ITS 1.0 specification is available at http://www.w3.org/TR/its/
* The ITS 2.0 specification is available at http://www.w3.org/TR/its20/

=== Default Rules ===

By default the filter process the XML documents based on the '''ITS defaults'''. That is:

* the content of all elements is translatable,
* and none of the values of the attribute translatable.

To modify this behavior you need to associate the document with ITS rules. This can be done different ways:

* By including global and local rules inside the document.
* By including inside the document a link to external global rules.
* By associating the document with a parameters file when running the filter. The parameter file being a set of external ITS global rules.

When processing a document, the filter...

# Assumes that all element content is translatable, and none of the attribute values are translatable.
# Applies the global rules found in the (optional) parameters file associated with the input document.
# Applies the global rules found in the document.
# And finally, applies the local rules within the document.

=== Example ===

For example, assuming that <code>ITSForDoc.xml</code> is the ITS file associated with the input file <code>Document.xml</code>, the translatable text is listed below.

<code>ITSForDoc.xml</code>:

<nowiki><its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0"></nowiki>
<its:translateRule selector="//head|//code" translate="no"/>
<its:withinTextRule selector="//b|//code|//img" withinText="yes"/>
</its:rules>

<code>Document.xml</code>:

<doc>
<head>
<update>2009-03-21</update>
<author>Mirabelle McIntosh</author>
</head>
<body>
Paragraph with <img ref="eg.png"/> and bolded text.
Paragraph with <code>data codes</code> and text.
</body>
</doc>

The resulting text units are (with the inline codes in XLIFF 1.2 notation):

1: "Paragraph with <x id='1'> and <g id='2'>bolded text</g>."
2: "Paragraph with <g id='1'><x id='2'/></g> and text."

=== Validation ===

The Relaxed project includes [http://relaxed.vse.cz/relaxed/validate?group=ITS an online validator for ITS].

Relaxed is an [http://sourceforge.net/projects/relaxed/ open-source project hosted on SourceForge]

=== Extensions ===

Several extensions have been defined by the ITS Interest Group. There are listed in the [http://www.w3.org/International/its/wiki/IssuesAndProposedFeatures Issues and Proposed Features] section of the Interest Group wiki.

The extension namespace is http://www.w3.org/2008/12/its-extensions

=== Proper Namespace Handling ===

If the input document file uses a namespace, the ITS file must uses the same namespace. For example, if the input document file looks like this:

<nowiki><doc xmlns="http://xmlx.org/ns/xmlx"></nowiki>
<head>
<update>2009-03-21</update>
<author>Mirabelle McIntosh</author>
</head>
<body>
Paragraph with <img ref="eg.png"/> and bolded text.
Paragraph with <code>data codes</code> and text.
</body>
</doc>

Then the ITS file must use the namespace like this:

<nowiki><its:rules xmlns:its="http://www.w3.org/2005/11/its" xmlns:xx="http://xmlx.org/ns/xmlx" version="1.0"></nowiki>
<its:translateRule selector="//xx:head|//xx:code" translate="no"/>
<its:withinTextRule selector="//xx:b|//xx:code|//xx:img" withinText="yes"/>
</its:rules>

==ITS in the Okapi Framework==

The Okapi Framework uses ITS in several places. For example:

* The [[XML Filter]] implements most of ITS data categories for XML documents.
* The [[HTML5-ITS Filter]] implements most of ITS data categories for HTML5 documents.
* Several pre-defined [[Filters|filter configurations]] are ITS files.
* The version 2.0 of ITS has been implemented in Okapi as one of the deliverables of the MultilingualWeb-LT project funded by the European Commission.
** [[MultilingualWeb-LT_D3.1.4|Online summary of the deliverable D3.1.4]]
** [http://www.w3.org/International/multilingualweb/lt/wiki/Main_Page Working Group wiki page]
** [http://www.w3.org/International/multilingualweb/lt/ Working Group home page]

'''For an overview of the components with ITS capability, see the [[ITS Components]] page.'''

[[Category:ITS]]

Open Standards

2021-02-20T02:21:52Z

Kuro2: /* TMX - Translation Memory eXchange */ Fixing the broken link to the TMX standard.

__TOC__
The localization and translation industry uses several standards to exchange data between tools. It is very important for tools to support such standards.

* They avoid your data to be locked into proprietary formats.
* Using standards also allows you to approach the translation process with a broader choice of options and more flexibility.

The applications and components of the Okapi Framework support standards when possible.

==XLIFF - XML Localisation Interchange File Format==

Maintained by the XLIFF Technical Committee at OASIS, XLIFF provides a common markup language for extracted localizable text.

* [http://docs.oasis-open.org/xliff/xliff-core/xliff-core.html XLIFF 1.2 specification]
* [http://www.oasis-open.org/committees/xliff/ The OASIS XLIFF Technical Committee home page]
* [[XLIFF|An overview of XLIFF]]

Many components of the framework use XLIFF. The framework also includes an [[XLIFF Filter]].

==TMX - Translation Memory eXchange==

TMX covers the exchange of translation memory data.

TMX was originally maintained by the OSCAR Committee at LISA. In March 2011 LISA was closed. The OSCAR standards have been put under Creative Commons license and the specifications moved to new hosts.

* [https://www.gala-global.org/tmx-14b TMX 1.4b specification]
* [[TMX|An overview of TMX]]

Many components of the framework use TMX. The framework also includes a [[TMX Filter]].

==SRX - Segmentation Rules eXchange==

SRX addresses the exchange of segmentation rules between tools. The version 1.0 of SRX has been implemented different ways by different tools and has limited usage for exchange. The version 2.0 of SRX has been implemented with better consistency.

SRX was originally maintained by the OSCAR Committee at LISA. In March 2011 LISA was closed. The OSCAR standards have been put under Creative Commons license and the specifications moved to new hosts.

* [http://www.gala-global.org/oscarStandards/srx/srx20.html SRX 2.0 specification]
* [[SRX|An overview of SRX]]

The segmentation engine provided in the framework implements SRX 2.0. You can see it in action in [[Ratel|Ratel, the framework's editor to create and maintain SRX documents]].

==TBX - Term Base eXchange==

TBX is designed to allow the exchange of terminology databases between tools. TBX the same as '''ISO 30042'''. Because TBX is quite complex, its adoption has been slow and OSCAR has come up with '''TBX-Basic''', a sub-set of the more general TBX.

TBX was originally maintained by the OSCAR Committee at LISA. In March 2011 LISA was closed. The OSCAR standards have been put under Creative Commons license and the specifications moved to new hosts.

* [http://www.gala-global.org/oscarStandards/tbx/tbx_oscar.pdf TBX Specification]
* [http://www.gala-global.org/oscarStandards/tbx/tbx-basic.html Information for TBX-Basic]

The [[Quality Check Step]], which is also used in [[CheckMate]], supports TBX as one of its glossary formats.

==ITS - Internationalization Tag Set==

ITS is a W3C namespace that provides internationalization information and support in XML documents.

* [http://www.w3.org/TR/its/ ITS 1.0 specification]
* [http://www.w3.org/International/its/ig/ The W3C ITS Interest Group home page]
* [http://www.w3.org/International/its/ig/simple-example.html Examples of XML documents with ITS markup]
* [[ITS|An overview of ITS]]

Several components of the framework support and use ITS. See the [[ITS Components]] page for details.

Related to ITS, the [http://www.w3.org/TR/xml-i18n-bp/ Best Practices for XML Internationalization W3C Note] can help you designing and authoring XML documents in a way they are easier to localize.

==GMX - Global information management Metrics eXchange==

GMX is a family of standards of globalization and localization-related metrics. The three components of GMX are:

* Volume (V) Global Information Management Metrics Volume addresses the issue of quantifying the workload for a given localization or translation task. GMX-V provides a standard and more precise definition of the statistics necessary for to assess the quantity of text (and costs) associated with language-related globalization tasks.
* Complexity (C) (proposed). GMX-C will provide a standard metric for the assessment of textual complexity with regard to globalization tasks. This format has not yet been defined.
* Quality (Q) (proposed). GMX-Q will provide a standard format for the specification of quality requirements for globalization tasks, thus allowing quality expectations to be specified in contracts and other agreements and verified. This format has not yet been defined.

GMX was originally maintained by the OSCAR Committee at LISA. In March 2011 LISA was closed. The OSCAR standards have been put under Creative Commons license and the specifications moved to new hosts.

* [http://www.xtm-intl.com/manuals/gmx-v/GMX-V-2.0.html GMX-V 2.0 specification]

Steps such as the [[Word Count Step]], [[Character Count Step]], and the [[Scoping Report Step]] provided in the framework use GMX-V 2.0.

==OAXAL - Open Architecture for XML Authoring and Localization==

Maintain by OASIS, OAXAL is a reference architecture that describes a processing model for authoring and localizing XML documents using open standards.

* [http://www.oasis-open.org/committees/download.php/35736/OASIS%20Open%20Architecture%20for%20XML%20Authoring%20and%20Localization%20Reference%20Model%20%28OAXAL%29.pdf OAXAL 1.0 specification]

===OAXAL 1.0 Conformance Statement===

This statement confirms that the Okapi Framework is an OAXAL 1.0 Level 2 compliant application as per the [http://wiki.oasis-open.org/oaxal/#A4ConformanceGuidelines OAXAL Reference Architecture 1.0 Specification conformance requirements], implementing the following constituent standards:

* W3C ITS 1.0
* OASIS XLIFF 1.2
* LISA TMX 1.4b
* LISA SRX 2.0
* LISA TBX 1.0
* LISA GMX/V 1.0

[[Category:GMX]] [[Category:ITS]] [[Category:SRX]] [[Category:TMX]] [[Category:XLIFF]]

TMX Filter

2021-02-20T02:20:17Z

Kuro2: Fixing the broken link to TMX standard

{{Filters Header}}
==Overview==

The TMX Filter is an Okapi component that implements the IFilter interface for [[TMX|TMX (Translation memory eXchange)]] documents. The filter is implemented in the class net.sf.okapi.filters.tmx.TmxFilter of the library.

TMX is a LISA Standard that defines a file format for transporting translation memory data from one translation tool to another. The TMX 1.4b specification is at https://www.gala-global.org/tmx-14b

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input document using the following logic:

* If the document has an encoding declaration it is used.
* Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).

===Output Encoding===

If the output encoding is UTF-8:

* If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
* If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

===Line-Breaks===

The type of line-breaks of the output is the same as the one of the original input.

==Parameters==

<cite>Read all target entries</cite> — Set this option to read all target <code><tuv></code> elements into the text unit. Otherwise only the selected target is read and all remaining ones become part of the skeleton. Default is True. Any effect this setting has depends on the following pipeline steps and the ability they have to process multiple targets.

<cite>Group all document parts skeleton into one</cite> — Set this option to consolidate the skeleton parts and send fewer events through the pipeline. Default is True. This is sufficient in most cases but as a pipeline developer sometimes you might want to have access to more fine-grained resources in the pipeline.

<cite>Exit when encountering invalid <tu>s</cite> — By default invalid <tu>s are skipped along with warning message(s). By using this default setting or ignoring the warning messages you might run the risk of getting a processed file that doesn't match the input file. Check this box if you want to be notified immediately of invalid content and want to correct the file before re-running it.

<cite>Creates or not a segment for the extracted <tu></cite> — Use this option to set create a segment or not for each extracted <code><tu></code> entry.
The following options are available:
* <cite>Always creates the segment</cite> - Creates the segment regardless of what the value of the <code>segtype</code> attribute.
* <cite>Never creates the segment</cite> - Never creates the segment, even if the <code>segtype</code> attribute is set to "sentence".
* <cite>Creates the segment if segtype is 'sentence' or is undefined</cite> Creates the segment when the <code>segtype</code> attribute is set to "sentence" or if it is not defined.
* <cite>Creates the segment only if segtype is 'sentence'</cite> Creates the segment only if the <code>segtype</code> attribute is set to "sentence".

<cite>Escape the greater-than characters</cite> — Set this option to have all greater-than characters ('<code>></code>') escaped as "<code>&gt;</code>" in the output.

<cite>Duplicate property value separator string</cite> — This string will be used to separate duplicate property values. Default is ", "

==Limitations==

The <code></code> element is not supported. When such element is found, a warning is issued, and the element content is put with the content of its parent element. 
The filter is not able to reconstruct any DTD declaration.

[[Category:Filters]] [[Category:TMX]]

TMX

2021-02-20T02:16:24Z

Kuro2: Fixed broken link to TMX standard.

{{Standards Common Menu}}
__TOC__
==Overview==

TMX, the (Translation Memory eXchange) format was originally maintained by the OSCAR special interest group of the Localisation Industry Standards Association (LISA). In March 2011 LISA was closed and its standards moved under Creative Commons license.

The purpose of TMX is to allow any tool using translation memories to import and export databases between their own native formats and a common format. This allow tools users to not be cornered in using a specific tool, but to make sure the asset that their TM databases constitutes can go through the raise and fall of different generation of translation tools.

The version 1.4b is the latest specification and can be found here: https://www.gala-global.org/tmx-14b .

Two important aspects of TMX to keep in mind:

* For text that includes inline codes (such as formatting, images, etc.), tools that support TMX Level 1 only are not providing true interoperability since they will lose all inline codes.
* TMX does not provide a standard for segmentation. Therefore there is no guarantee that a TM will yield the same results when moved from one tool to the other, even for exact matches. This is not a problem specific to TMX, but a general issue of segmentation that occurs regardless which format you use to migrate your TMs. The adoption of the [[SRX|SRX (Segmentation Rules eXchange format)]] help to carry the information about how the segments of the TM have been made, but this does not solve all the issues.

==Example==

Example of TMX document with one entry:

<pre>
<tmx version="1.4b">
<header creationtool="XYZTool" creationtoolversion="1.01-023"
datatype="PlainText" segtype="sentence"
adminlang="en-us" srclang="EN"
o-tmf="ABCTransMem">
</header>
<body>
<tu>
<tuv xml:lang="en">
<seg>Text in <bpt i="1">&lt;B></bpt>bold<ept i="1">&lt;/B></ept>.</seg>
</tuv>
<tuv xml:lang="fr">
<seg>Texte en <bpt i="1">&lt;B></bpt>gras<ept i="1">&lt;/B></ept>.</seg>
</tuv>
</tu>
</body>
</tmx>
</pre>

==TMX in the Okapi Framework==

The Okapi Framework use TMX in many places. For example:

* The [[TMX Filter]] processes TMX input as any other multilingual input.
* Some steps, like the [[Format Conversion Step]], can convert to and from TMX.
* Other steps, like the [[Leveraging Step]], can generate TMX documents
* See also [[:Category:TMX]]

[[Category:TMX]]

Microsoft Translator Connector

2019-08-23T08:44:22Z

Kuro2: /* Limitations */

{{Connectors Header}}
__TOC__
==Overview==
The Microsoft [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/ Translator Text Service] provides a machine translation over a REST API. The service supports a large number of language pairs, both common and less common. The list is available at [https://docs.microsoft.com/en-us/azure/cognitive-services/Translator/language-support#translation]. (Please see the list under '''V3 Translator API'''.)

This connector uses [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-reference the V3 API]. To use this connector you need an '''Azure Key''' from Microsoft. See [https://translatorbusiness.uservoice.com/knowledgebase/articles/1078534-microsoft-translator-on-azure#signup the Microsoft pages] for more information.

For more examples on how to use this connector see the article "[[Trying out the Microsoft Translator Connector]]" in the [[Knowledge Base]]. See also the [[Microsoft Batch Translation Step]].

==Parameters==

<cite>Azure Key</cite> — The Microsoft Azure key to use this Translator Text and other Microsoft Cognitive Services.

<cite>Category</cite> — An optional category to use when working with trained engines. The service defaults to "general" if none is supplied.

Example of a configuration file:

#v1
azureKey=myAzureKey
category=general

==Details==
==== Calculation of the combined score ====

The original score of the query is preserved in the <code>score</code> field of the query result.

The <code>combinedScore</code> of the query result holds a re-calculated value that takes into account both the <code>MatchDegree</code> and <code>Rating</code> values returned by the engine.

For the results with a <code>MatchDegree</code> or 90 or above, the combined score is computed by adding the <code>Rating</code> value minus 10. For the results with a <code>MatchDegree</code> below 90, the combined score is simply the <code>MatchDegree</code>.

{| border="1" cellpadding="5" cellspacing="0"
|+
| '''MatchDegree''' || '''Rating''' || '''Combined Score'''
|- valign="top"
| 100 || 5 || 95 (i.e. 100+(5-10))
|- valign="top"
| 100 || 6 || 96 (i.e. 100+(6-10))
|- valign="top"
| 100 || 0 || 90 (i.e. 100+(0-10))
|- valign="top"
| 100 || -3 || 87 (i.e. 100+(-3-10))
|- valign="top"
| 98 || 9 || 97 (i.e. 98+(9-10))
|- valign="top"
| 95 || 5 || 90 (i.e. 95+(5-10))
|}

Such calculation is far from perfect especially between highly rated high fuzzy matches and a low rated exact matches. But such entries are difficult to rank even manually. We will try to improve this scoring and welcome any feedback you may have.

If a result has no <code>Rating</code> the default is set to 5. Unverified MT translation will generally return a <code>MatchDegree</code> of 100 and a <code>Rating</code> of 5, which will compute into a combined score of 95 in the Okapi interface.

==Limitations==

* According to the [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-translate?tabs=curl#request-body API document], at most 100 JSON array elements can be supplied and the entire text cannot exceeds 5000 characters.
* The service may, on occasion, not generate back the proper spaces. This happens especially when there are inline codes present in the source.
* Only the translation feature of the Translator Text Service is supported by the connector. Obtaining a list of supported languages, transliteration, or language identification (detection) is not supported.
* Only the category parameter can be specified. Profanity detection and deletion, script conversion, and other features are not supported.

==History==
===Retirement of version 2 API===
Microsoft has retireed their version 2 API on 2019-4-30 as described in [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/migrate-to-v3 this page].
Because of this, the Microsoft Connector found in the latest stable release, M37, will no longer work on and after 2019-5-01.

The support of the version 3 API has been added to the M38 snapshot version from [http://okapiframework.org/snapshots/ here] in April 2019.

Please note this is a minimal implementation and it does not support any new features such as profanity filtering,

Because the version 3 API no longer supports the translation memory, that aspect of function is not available even if you use the latest Okapi M38 snapshot version.

You will need an "azure key" to use the version 3 API. If you already have a key for version 2, the same key should work.
For information on how to obtain an azure key, please see [https://azure.microsoft.com/en-us/pricing/details/cognitive-services/ this page].

===Old Parameters Prior To M32===

<cite>Client ID</cite> — The Client ID to use to connect to the MT server. See See [http://msdn.microsoft.com/en-us/library/hh454950.aspx the MSDN pages] for more information.

<cite>Secret</cite> — The secret corresponding to the Client ID.

<cite>Category</cite> — An optional category to use when working with trained engines.

Example of a configuration file:

#v1
clientId=myPersonalClientID
secret=theSecretForThatClientID

[[Category:Connectors]]

Trying out the Microsoft Translator Connector

2019-08-16T01:21:23Z

Kuro2:

__TOC__

==Overview==
The [[Microsoft Translator Connector]] is an Okapi component that connects to [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/ Microsoft Translator Text Service] (referred to as '''Translator Service''' hereafter), which is part of the Microsoft Cognitive Services.

This wiki page explains how to try out the Translator Service using the Tikal command line utility.

==Retirement of version 2 API==
Microsoft has retired their version 2 API on 2019-4-30 as described in [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/migrate-to-v3 this page].
Because of this, the Microsoft Connector found in the latest stable release, M37, no longer works on and after 2019-5-01.

The support of the version 3 API has been added to Okapi in mid April after the M37 release. To use Microsoft's machine translation service, please pick up the M38 snapshot version from [http://okapiframework.org/snapshots/ here].

The rest of this page assumes that you are using the M38 snapshot version built after mid April, 2019, the M38 stable release (which has not been released as of this writing in mid August, 2019), or later.

==Obtaining Azure Key==
To use the Microsoft Translator Connector, you need an Azure Key.
If you already have a key for version 2 API, the same key should work.
Otherwise, please read [https://azure.microsoft.com/en-us/pricing/details/cognitive-services/ this page].
Microsoft issues a key free of charge with certain limitations, which is enough to try out the connector as described in this page.

== Searching Translations ==

=== Manual Queries ===
[[Tikal]] provides a way to try out the connector easily.

First you need to create a configuration file that looks like:

#v1
azureKey=your-azure-key
baseURL=the-base-url

using a text editor.
Here ''your-azure-key'' is the Azure Key that was obtained from Microsoft.
''the-base-url'' is one of the URLs listed in [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-reference#base-urls Base URLs] section in the API Reference.

For example (warning: the Azure Key here is not valid):
#v1
azureKey=4f4cfe47becf471a0123456789abcdef
baseURL=https://api-nam.cognitive.microsofttranslator.com

We assume you have saved this file as <code>config.cfg</code>.

Now you can use the connector with Tikal. Try for instance:

tikal.sh -q "This is a test" -sl en -tl fr -ms config.cfg

(On a Windows system, type "tikal" instead of "./tikal.sh".)

(On a Linux/Unix/macOS system and PATH doesn't include ".", type "./tikal.sh" instead.)

This command line uses the following parameters:

* <code>-q "This is a test"</code> indicates that we want to search for a translation (i.e. [[Tikal - Translation Commands#Query Translation Resources|do a query]]) and the source text to search for is "<code>This is a test</code>".
* <code>-sl en</code> indicates that the source language is English
* <code>-tl fr</code> indicates that the target language is French
* <code>-ms config.cfg</code> specifies to use the [[Microsoft Translator Connector]] and to use <code>config.cfg</code> for the connector's configuration.

This should give you back something like:

= From net.sf.okapi.connectors.microsoft.MicrosoftMTConnector (en->fr)
Threshold=-10, Maximum hits=1
Engine: 'general'
score: 95, origin: 'Microsoft-Translator' (from MT)
Source: "This is a test"
Target: "C'est un test"

=== With the [[Leveraging Step]] ===
The connector is available in the [[Leveraging Step]], so you can use it on any pipeline you need.

You can also use Tikal's [[Tikal - Translation Commands#Translate Files|Translate Files]] command to process directly an file supported by Okapi. For example, the following command creates an output file <code>myFile.out.docx</code> translated into Japanese. That is if the file is small enough to be processed within the limitations of your license.

tikal.sh -t myFile.docx -sl en -tl ja -ms config.cfg

=== With the [[Microsoft Batch Translation Step]] ===
The [[Microsoft Batch Translation Step]] can also be used to generate the target text using the Translator Service.

For example, to translate any document for which Okapi has a filter you can use the following pipeline:

: = [[Raw Document to Filter Events Step]]
: + [[Microsoft Batch Translation Step]]
: + [[Filter Events to Raw Document Step]]

The Microsoft Batch Translation Step is the preferred Step to use over the [[Leveraging Step]] because it sends many pieces (paragraphs) of text in one batch and more efficient. However, this might cause too many or too large text to be sent to the Translator Service than the service's limits. If that happens, the work around might be to use the Leveraging Step.

==Obsolete Features==
The following features are no longer supported because the Translator Service no longer supports the underlying features:
* The Translator Service no longer has a built-in translation memory feature.
* [[Microsoft Batch Submission Step]]
* The threshold and the number of maximum hits that could be specified with <code>-opt</code> command line flag for Tikal or the Microsoft Batch Translation Step UI have no effect.

[[Category:Connectors]] [[Category:Tikal]]

Trying out the Microsoft Translator Connector

2019-08-16T01:19:28Z

Kuro2: Major update

__TOC__

Warning: This page is being updated and not fully accurate. (2019-8-15)

==Overview==
The [[Microsoft Translator Connector]] is an Okapi component that connects to [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/ Microsoft Translator Text Service] (referred to as '''Translator Service''' hereafter), which is part of the Microsoft Cognitive Services.

This wiki page explains how to try out the Translator Service using the Tikal command line utility.

==Retirement of version 2 API==
Microsoft has retired their version 2 API on 2019-4-30 as described in [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/migrate-to-v3 this page].
Because of this, the Microsoft Connector found in the latest stable release, M37, no longer works on and after 2019-5-01.

The support of the version 3 API has been added to Okapi in mid April after the M37 release. To use Microsoft's machine translation service, please pick up the M38 snapshot version from [http://okapiframework.org/snapshots/ here].

The rest of this page assumes that you are using the M38 snapshot version built after mid April, 2019, the M38 stable release (which has not been released as of this writing in mid August, 2019), or later.

==Obtaining Azure Key==
To use the Microsoft Translator Connector, you need an Azure Key.
If you already have a key for version 2 API, the same key should work.
Otherwise, please read [https://azure.microsoft.com/en-us/pricing/details/cognitive-services/ this page].
Microsoft issues a key free of charge with certain limitations, which is enough to try out the connector as described in this page.

== Searching Translations ==

=== Manual Queries ===

[[Tikal]] provides a way to try out the connector easily.

First you need to create a configuration file that looks like:

#v1
azureKey=your-azure-key
baseURL=the-base-url

using a text editor.
Here ''your-azure-key'' is the Azure Key that was obtained from Microsoft.
''the-base-url'' is one of the URLs listed in [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-reference#base-urls Base URLs] section in the API Reference.

For example (warning: the Azure Key here is not valid):
#v1
azureKey=4f4cfe47becf471a0123456789abcdef
baseURL=https://api-nam.cognitive.microsofttranslator.com

We assume you have saved this file as <code>config.cfg</code>.

Now you can use the connector with Tikal. Try for instance:

tikal.sh -q "This is a test" -sl en -tl fr -ms config.cfg

(On a Windows system, type "tikal" instead of "./tikal.sh".)

(On a Linux/Unix/macOS system and PATH doesn't include ".", type "./tikal.sh" instead.)

This command line uses the following parameters:

* <code>-q "This is a test"</code> indicates that we want to search for a translation (i.e. [[Tikal - Translation Commands#Query Translation Resources|do a query]]) and the source text to search for is "<code>This is a test</code>".
* <code>-sl en</code> indicates that the source language is English
* <code>-tl fr</code> indicates that the target language is French
* <code>-ms config.cfg</code> specifies to use the [[Microsoft Translator Connector]] and to use <code>config.cfg</code> for the connector's configuration.

This should give you back something like:

= From net.sf.okapi.connectors.microsoft.MicrosoftMTConnector (en->fr)
Threshold=-10, Maximum hits=1
Engine: 'general'
score: 95, origin: 'Microsoft-Translator' (from MT)
Source: "This is a test"
Target: "C'est un test"

=== With the [[Leveraging Step]] ===

The connector is available in the [[Leveraging Step]], so you can use it on any pipeline you need.

You can also use Tikal's [[Tikal - Translation Commands#Translate Files|Translate Files]] command to process directly an file supported by Okapi. For example, the following command creates an output file <code>myFile.out.docx</code> translated into Japanese. That is if the file is small enough to be processed within the limitations of your license.

tikal.sh -t myFile.docx -sl en -tl ja -ms config.cfg

=== With the [[Microsoft Batch Translation Step]] ===

[[Image:MSBatchTranslation.png|thumb|600px|Microsoft Batch Translation Step (Windows 7)]]
The [[Microsoft Batch Translation Step]] can also be used to generate the target text using the Translator Service.

For example, to translate any document for which Okapi has a filter you can use the following pipeline:

: = [[Raw Document to Filter Events Step]]
: + [[Microsoft Batch Translation Step]]
: + [[Filter Events to Raw Document Step]]

The Microsoft Batch Translation Step is the preferred Step to use over the [[Leveraging Step]] because it sends many pieces (paragraphs) of text in one batch and more efficient. However, this might cause too many or too large text to be sent to the Translator Service than the service's limits. If that happens, the work around might be to use the Leveraging Step.

==Obsolete Features==
The following features are no longer supported because the Translator Service no longer supports the underlying features:
* The Translator Service no longer has a built-in translation memory feature.
* [[Microsoft Batch Submission Step]]
* The threshold and the number of maximum hits that could be specified with <code>-opt</code> command line flag for Tikal or the Microsoft Batch Translation Step UI have no effect.

[[Category:Connectors]] [[Category:Tikal]]

Microsoft Translator Connector

2019-08-15T01:07:00Z

Kuro2: Minor fix

{{Connectors Header}}
__TOC__
==Overview==
The Microsoft [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/ Translator Text Service] provides a machine translation over a REST API. The service supports a large number of language pairs, both common and less common. The list is available at [https://docs.microsoft.com/en-us/azure/cognitive-services/Translator/language-support#translation]. (Please see the list under '''V3 Translator API'''.)

This connector uses [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-reference the V3 API]. To use this connector you need an '''Azure Key''' from Microsoft. See [https://translatorbusiness.uservoice.com/knowledgebase/articles/1078534-microsoft-translator-on-azure#signup the Microsoft pages] for more information.

For more examples on how to use this connector see the article "[[Trying out the Microsoft Translator Connector]]" in the [[Knowledge Base]]. See also the [[Microsoft Batch Translation Step]].

==Parameters==

<cite>Azure Key</cite> — The Microsoft Azure key to use this Translator Text and other Microsoft Cognitive Services.

<cite>Category</cite> — An optional category to use when working with trained engines. The service defaults to "general" if none is supplied.

Example of a configuration file:

#v1
azureKey=myAzureKey
category=general

==Details==
==== Calculation of the combined score ====

The original score of the query is preserved in the <code>score</code> field of the query result.

The <code>combinedScore</code> of the query result holds a re-calculated value that takes into account both the <code>MatchDegree</code> and <code>Rating</code> values returned by the engine.

For the results with a <code>MatchDegree</code> or 90 or above, the combined score is computed by adding the <code>Rating</code> value minus 10. For the results with a <code>MatchDegree</code> below 90, the combined score is simply the <code>MatchDegree</code>.

{| border="1" cellpadding="5" cellspacing="0"
|+
| '''MatchDegree''' || '''Rating''' || '''Combined Score'''
|- valign="top"
| 100 || 5 || 95 (i.e. 100+(5-10))
|- valign="top"
| 100 || 6 || 96 (i.e. 100+(6-10))
|- valign="top"
| 100 || 0 || 90 (i.e. 100+(0-10))
|- valign="top"
| 100 || -3 || 87 (i.e. 100+(-3-10))
|- valign="top"
| 98 || 9 || 97 (i.e. 98+(9-10))
|- valign="top"
| 95 || 5 || 90 (i.e. 95+(5-10))
|}

Such calculation is far from perfect especially between highly rated high fuzzy matches and a low rated exact matches. But such entries are difficult to rank even manually. We will try to improve this scoring and welcome any feedback you may have.

If a result has no <code>Rating</code> the default is set to 5. Unverified MT translation will generally return a <code>MatchDegree</code> of 100 and a <code>Rating</code> of 5, which will compute into a combined score of 95 in the Okapi interface.

==Limitations==

* According to the [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-translate?tabs=curl#request-body API document], at most 100 JSON array elements can be supplied and the entire text cannot exceeds 5000 characters.
* The service may, on occasion, not generate back the proper spaces. This happens especially when there are inline codes present in the source.

==History==
===Retirement of version 2 API===
Microsoft has retireed their version 2 API on 2019-4-30 as described in [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/migrate-to-v3 this page].
Because of this, the Microsoft Connector found in the latest stable release, M37, will no longer work on and after 2019-5-01.

The support of the version 3 API has been added to the M38 snapshot version from [http://okapiframework.org/snapshots/ here] in April 2019.

Please note this is a minimal implementation and it does not support any new features such as profanity filtering,

Because the version 3 API no longer supports the translation memory, that aspect of function is not available even if you use the latest Okapi M38 snapshot version.

You will need an "azure key" to use the version 3 API. If you already have a key for version 2, the same key should work.
For information on how to obtain an azure key, please see [https://azure.microsoft.com/en-us/pricing/details/cognitive-services/ this page].

===Old Parameters Prior To M32===

<cite>Client ID</cite> — The Client ID to use to connect to the MT server. See See [http://msdn.microsoft.com/en-us/library/hh454950.aspx the MSDN pages] for more information.

<cite>Secret</cite> — The secret corresponding to the Client ID.

<cite>Category</cite> — An optional category to use when working with trained engines.

Example of a configuration file:

#v1
clientId=myPersonalClientID
secret=theSecretForThatClientID

[[Category:Connectors]]

Microsoft Translator Connector

2019-08-15T01:06:15Z

Kuro2: Update to match v3 API, first attempt

{{Connectors Header}}
__TOC__
==Overview==
The Microsoft [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/ Translator Text Service] provides a machine translation over a REST API. The service supports a large number of language pairs, both common and less common. The list is available at [https://docs.microsoft.com/en-us/azure/cognitive-services/Translator/language-support#translation]. (Please see the list under '''V3 Translator API'''.)

This connector uses [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-reference the V3 API]. To use this connector you need an '''Azure Key''' from Microsoft. See [https://translatorbusiness.uservoice.com/knowledgebase/articles/1078534-microsoft-translator-on-azure#signup the Microsoft pages] for more information.

For more examples on how to use this connector see the article "[[Trying out the Microsoft Translator Connector]]" in the [[Knowledge Base]]. See also the [[Microsoft Batch Translation Step]].

==Parameters==

<cite>Azure Key</cite> — The Microsoft Azure key to use this Translator Text and other Microsoft Cognitive Services.
<cite>Category</cite> — An optional category to use when working with trained engines. The service defaults to "general" if none is supplied.

Example of a configuration file:

#v1
azureKey=myAzureKey
category=general

==Details==
==== Calculation of the combined score ====

The original score of the query is preserved in the <code>score</code> field of the query result.

The <code>combinedScore</code> of the query result holds a re-calculated value that takes into account both the <code>MatchDegree</code> and <code>Rating</code> values returned by the engine.

For the results with a <code>MatchDegree</code> or 90 or above, the combined score is computed by adding the <code>Rating</code> value minus 10. For the results with a <code>MatchDegree</code> below 90, the combined score is simply the <code>MatchDegree</code>.

{| border="1" cellpadding="5" cellspacing="0"
|+
| '''MatchDegree''' || '''Rating''' || '''Combined Score'''
|- valign="top"
| 100 || 5 || 95 (i.e. 100+(5-10))
|- valign="top"
| 100 || 6 || 96 (i.e. 100+(6-10))
|- valign="top"
| 100 || 0 || 90 (i.e. 100+(0-10))
|- valign="top"
| 100 || -3 || 87 (i.e. 100+(-3-10))
|- valign="top"
| 98 || 9 || 97 (i.e. 98+(9-10))
|- valign="top"
| 95 || 5 || 90 (i.e. 95+(5-10))
|}

Such calculation is far from perfect especially between highly rated high fuzzy matches and a low rated exact matches. But such entries are difficult to rank even manually. We will try to improve this scoring and welcome any feedback you may have.

If a result has no <code>Rating</code> the default is set to 5. Unverified MT translation will generally return a <code>MatchDegree</code> of 100 and a <code>Rating</code> of 5, which will compute into a combined score of 95 in the Okapi interface.

==Limitations==

* According to the [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-translate?tabs=curl#request-body API document], at most 100 JSON array elements can be supplied and the entire text cannot exceeds 5000 characters.
* The service may, on occasion, not generate back the proper spaces. This happens especially when there are inline codes present in the source.

==History==
===Retirement of version 2 API===
Microsoft has retireed their version 2 API on 2019-4-30 as described in [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/migrate-to-v3 this page].
Because of this, the Microsoft Connector found in the latest stable release, M37, will no longer work on and after 2019-5-01.

The support of the version 3 API has been added to the M38 snapshot version from [http://okapiframework.org/snapshots/ here] in April 2019.

Please note this is a minimal implementation and it does not support any new features such as profanity filtering,

Because the version 3 API no longer supports the translation memory, that aspect of function is not available even if you use the latest Okapi M38 snapshot version.

You will need an "azure key" to use the version 3 API. If you already have a key for version 2, the same key should work.
For information on how to obtain an azure key, please see [https://azure.microsoft.com/en-us/pricing/details/cognitive-services/ this page].

===Old Parameters Prior To M32===

<cite>Client ID</cite> — The Client ID to use to connect to the MT server. See See [http://msdn.microsoft.com/en-us/library/hh454950.aspx the MSDN pages] for more information.

<cite>Secret</cite> — The secret corresponding to the Client ID.

<cite>Category</cite> — An optional category to use when working with trained engines.

Example of a configuration file:

#v1
clientId=myPersonalClientID
secret=theSecretForThatClientID

[[Category:Connectors]]

Microsoft Batch Translation Step

2019-04-25T03:36:32Z

Kuro2:

{{Steps Header}}
__TOC__
==Retirement of version 2 API==
MICROSOFT CONNECTOR of the Okapi stable releases will STOP WORKING at the end of April, 2019.

Microsoft will retire their version 2 API on 2019-4-30 as described in [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/migrate-to-v3 this page].
Because of this, the Microsoft Connector found in the latest stable release, M37, will no longer work on and after 2019-5-01.

The support of the version 3 API has been added to Okapi in mid April after the M37 release. If you need to use Microsoft's machine translation service, please pick up the M38 snapshot version from [http://okapiframework.org/snapshots/ here].
Please note this is a minimal implementation and it does not support any new features such as profanity filtering,

Because the version 3 API no longer supports the translation memory, that aspect of function is not available even if you use the latest Okapi M38 snapshot version.

You will need an "azure key" to use the version 3 API. If you already have a key for version 2, the same key should work.
For information on how to obtain an azure key, please see [https://azure.microsoft.com/en-us/pricing/details/cognitive-services/ this page].

'''Information below is mostly out of date. It is kept as reference until full update of this page is done.'''

==Overview==

This step annotates text units of the input documents with [http://www.microsofttranslator.com Microsoft Translator] candidates or/and creates a TM from them.

Takes: Filter events. Sends: Filter events (possibly annotated) or raw document.

You must have a "Client ID" and a "Client Secret" from Microsoft to use this step. If you get those by obtaining a Windows Live ID, and then registering an application in your Live account. See [http://msdn.microsoft.com/en-us/library/hh454950.aspx the MSDN pages] for more information.

You must also respect Microsoft's Terms of Service. If you intend to use the Microsoft Translator API for commercial or high volume purposes, you would need to sign a commercial license agreement and provide your AppID to the Microsoft Translator team. For more details contact [mailto:mtlic@microsoft.com mtlic@microsoft.com].

Text units flagged as non-translatable are not send for translation.

Note that using the [[Leveraging Step]] with the [[Microsoft Translator Connector]] will produces MT results similar to this step. However, this step can process several text units at once and therefore is much faster.

Improving automatically MT output can be done in some cases. For example extra or missing spaces around inline codes can be fixed with the [[Space Check Step]].

==Parameters==

<cite>Client ID</cite> — The Client ID to use to connect to the MT server. See [http://msdn.microsoft.com/en-us/library/hh454950.aspx the MSDN pages] for more information.

<cite>Client Secret</cite> — The secret corresponding to the Client ID.

<cite>Category</cite> — An optional category to use when working with trained engines. You can either enter directly the engine identifier (called 'category' in [https://hub.microsofttranslator.com Microsoft Translator Hub]), or you can use a keyword in the form <code>@@@keyword@@@</code>. If you specify a keyword you must specify a properties file in the <cite>Engine Mapping</cite> field.

The keyword can be a literal string or the <code>${domain}</code> variable. When <code>${domain}</code> is used, the variable is replaced by the first occurrence of the value for the [[ITS_Components|ITS Domain annotation]] found on a text unit. Ideally this Domain annotation should be set on the first text unit of the first document processed. All batches of events translated before a domain annotation is found are translated with the empty category.

{{NoteBox|As stated above, only the first occurrence of the Domain annotation has an effect on the selection of the engine.}}

{{NoteBox|Also, because this step is working on batches, segments before the first occurrence of the Domain annotation but within the same batch will be translated with the engine specified by the domain. For example: If you have 100 events and the <cite>Events buffer</cite> is set to 50 and the first occurrence of the Domain annotation is in the 60th event: The first 50 events will be translated with the empty category and all the other events with the engine corresponding to the specified domain, including the events 51 to 59.}}

<cite>Engine Mapping</cite> — Enter the path of the properties file that contains the mapping between the category keywords and the Microsoft Translator Hub engine identifier. You can use the variables <code>${rootDir}</code> and <code>${inputRootDir}</code>, as well as any of the [[Template:Locales Variables|source or target locale variables]] (<code>${srcLoc}</code>, <code>${trgLoc}</code>, etc). Leave the path empty to not use a mapping. The properties file is a list of lines in the form:

<keyword>.<language>=<engineID>

Where:

* <code><keyword></code> is a case-sensitive string (without spaces, sign equal or periods) that corresponds to the <code>keyword</code> part in <code>@@@keyword@@@</code>.
* <code><language></code> is the uppercase language code of the target locale to process.

For example, if you have the following engine mapping file:

travel.FR=11111111-2222-3333-4444-e42f530c98b8_tra
client1.DE=11111111-2222-3333-4444-90dd26cc9dsd3_gen
client1.tech.DE=11111111-2222-3333-4444-90dd26cc9d48_tech
client2.DE=11111111-2222-3333-4444-90dd26cc9ds34_gen

To use the first engine (assuming you are translating into french), specify <code>@@@travel@@@</code> in the <cite>Category</cite>. To use the third engine specify <code>@@@client1.tech@@@</code>. There is also a fallback mechanism where if you specify <code>@@@client2.law@@@</code> it would first look for client2.law.DE and if not found it would look for a client2.DE. If no custom engine is found the generic Microsoft provided engine is used.

<cite>Events buffer</cite> — Enter the number of events to buffer for a single query to the engine. The largest the buffer, the fastest the processing. But there are limitations related to the volume of text you can process at once as well.

<cite>Maximum matches</cite> — Enter the maximum number of matches you want to allow per source text.

<cite>Threshold</cite> — Enter the score below which a match is not keep as a result. See the [[Microsoft Translator Connector]] to understand how scores are computed based on their match degree an rating values.

<cite>Query only entries without existing candidate</cite> — Set this option to send to Microsoft Translator only the text for which there is currently no candidate (i.e. annotations added by previous steps or coming from the original document).

<cite>Annotate the text units with the translations</cite> — set this option to add to the text units annotations that holds the matches found. Those annotations may be used later by other steps. Existing annotations are preserved.

<cite>Generate a TMX document</cite> — Set this option to create a TMX output. Enter the full path of the TMX document to generate. If another document exists already it will be overwritten. You can use the variables <code>${rootDir}</code> and <code>${inputRootDir}</code>, as well as any of the [[Template:Locales Variables|source or target locale variables]] (<code>${srcLoc}</code>, <code>${trgloc}</code>, etc).

<cite>Send the TMX document to the next step</cite> — Set this option to have the generated TMX document as the only input raw document passed on to the next step of the pipeline. If this option is not set the filter events are passed on to the next step.

<cite>Mark the generated translation as machine translation results</cite> — Set this option to mark the TM entries generated as the result of machine translation. For example, when this option is set, the creationId attribute of the TMX <code><Tu></code> element is set to "MT!".

<cite>Fill the target with the best translation candidate</cite> — Set this option to copy the translation with the best score (and a score above or equal to the <cite>Fill threshold</cite>) into the target (if it is empty). Only the matches returned by the Microsoft Translator engine are taken into account.

<cite>Fill threshold</cite> — If the score of the best match is below the provided value, no translation is not copied into the target.

==Limitations==

* The Microsoft Translator API has some restriction for high volume usage. Contact Microsoft for details.
* Only the first ITS Domain annotation in a batch is taken into account.
* See also the limitations on the [[Microsoft Translator Connector]].

[[Category:Steps]] [[Category:ITS]] [[Category:TMX]]

Trying out the Microsoft Translator Connector

2019-04-25T03:30:39Z

Kuro2: API v3 update

__TOC__
==Retirement of version 2 API==
MICROSOFT CONNECTOR of the Okapi stable releases will STOP WORKING at the end of April, 2019.

Microsoft will retire their version 2 API on 2019-4-30 as described in [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/migrate-to-v3 this page].
Because of this, the Microsoft Connector found in the latest stable release, M37, will no longer work on and after 2019-5-01.

The support of the version 3 API has been added to Okapi in mid April after the M37 release. If you need to use Microsoft's machine translation service, please pick up the M38 snapshot version from [http://okapiframework.org/snapshots/ here].
Please note this is a minimal implementation and it does not support any new features such as profanity filtering,

Because the version 3 API no longer supports the translation memory, that aspect of function is not available even if you use the latest Okapi M38 snapshot version.

You will need an "azure key" to use the version 3 API. If you already have a key for version 2, the same key should work.
For information on how to obtain an azure key, please see [https://azure.microsoft.com/en-us/pricing/details/cognitive-services/ this page].

'''Information below is mostly out of date. It is kept as reference until full update of this page is done.'''
<hr/>
The [[Microsoft Translator Connector]] allows you to access Microsoft Translator system through its API.

You must have a "Client ID" and a "Client Secret" from Microsoft to use it. If you get those by obtaining a Windows Live ID, and then registering an application in your Live account. See [http://msdn.microsoft.com/en-us/library/hh454950.aspx the MSDN pages] for more information.

Note that for commercial or high volume usage you must have a license with Microsoft. The API has some restrictions in its throughput for callers without license. Those restrictions may change without notice and vary based on service utilization, trying to ensure fairness. More information can be found on the [http://social.msdn.microsoft.com/Forums/en-US/category/translation Microsoft Translator forums].

{{NoteBox|You need to have a release between M13 and M17 to try out the batch translation and the submission features using the Microsoft AppID authentication. Starting at M18 the library uses the Client ID/Secret authentication.}}

== Searching Translations ==

=== Manual Queries ===

[[Tikal]] provides a way to try out the connector easily.

First you need to create a configuration file that has your credentials. You can create the file with a simple text editor, it should be as follow:

To use the connector with an AppID (obsolete):

#v1
appId=yourAppID

To use the connector with a Client ID/Secret:

#v1
clientId=myPersonalClientID
secret=theSecretForThatClientID

Name the file for example <code>config.cfg</code>.

Now you can use the connector with Tikal. Try for instance:

tikal -q "This is a test" -sl en -tl fr -ms config.cfg

This command line uses the following parameters:

* <code>-q "This is a test"</code> indicates that we want to search for a translation (i.e. [[Tikal - Translation Commands#Query Translation Resources|do a query]]) and the source text to search for is "<code>This is a test</code>".
* <code>-sl en</code> indicates that the source language is English
* <code>-tl fr</code> indicates that the target language is French
* <code>-ms config.cfg</code> specifies to use the [[Microsoft Translator Connector]] and to use <code>config.cfg</code> for the connector's configuration.

This should give you back something like:

= From Microsoft-Translator (en->fr)
Threshold=95, Maximum hits=1
score: 95, origin: 'Microsoft-Translator'
Source: "This is a test"
Target: "Il s'agit d'un test."

By default the query is done with a threshold of 95. The threshold is the value under which the matches (or hits) are not retained. The default maximum number of hits displayed is 1.

You can change those options with the parameter <code>-opt</code>. For example:

tikal -q "This is a test" -sl en -tl fr -ms config.cfg -opt 70:5

This will set the threshold to 70 and the maximum number of hits to 5.

=== With the [[Leveraging Step]] ===

The connector is available in the [[Leveraging Step]], so you can use it on any pipeline you need.

You can also use Tikal's [[Tikal - Translation Commands#Translate Files|Translate Files]] command to process directly an file supported by Okapi. For example, the following command creates an output file <code>myFile.out.docx</code> translated into Japanese. That is if the file is small enough to be processed withing the limitations of the API for non-licensed users.

tikal -t myFile.docx -sl en -tl ja -ms config.cfg

Both options use the <code>GetTranslations</code> method of the API, which works segment by segment, and may result in slower process because of this.

=== With the [[Microsoft Batch Translation Step]] ===

[[Image:MSBatchTranslation.png|thumb|600px|Microsoft Batch Translation Step (Windows 7)]]
The [[Microsoft Batch Translation Step]] takes advantage of the <code>GetTranslationsArray</code> method of the API and allows you to process your input much faster.

For example, to translate any document for which Okapi has a filter you can use the following pipeline:

: = [[Raw Document to Filter Events Step]]
: + [[Microsoft Batch Translation Step]]
: + [[Filter Events to Raw Document Step]]

(See the article "[[How to Create a Pipeline in Rainbow]]" to learn about pipelines)

The step can perform several actions:

* Annotate the text units with the matches found.
* Copy the best translation in the target
* Generate a [[TMX]] document

Like always, this step is restricted to the limitations of the service.

If you set the <cite>Maximum matches</cite> value to more than 1, you may get several results: The MT-generated translation as well as one or more translations added to the repository. Use the <cite>Threshold</cite> value to filter out matches below a given score.

== Adding Translations ==

One interesting aspect of the Microsoft Translator is that anyone can contribute to the translation. This is done using Microsoft's [http://blogs.msdn.com/b/translation/archive/2010/03/15/collaborative-translations-announcing-the-next-version-of-microsoft-translator-technology-v2-apis-and-widget.aspx Collaborative Translation Framework] which provides the necessary API to add translations to the repository.

Note that the entries you are submitting must be single sentences. Any entry containing multiple sentences will be rejected automatically.

In Okapi you can use the feature through:
* Tikal (to enter one translation at a time),
* or the [[Microsoft Batch Submission Step]] (to provide a batch of aligned sentence from a [[TMX]] file or any other bi-lingual format supported by the framework).

The entries you add to the system can be access immediately. They are ranked higher than the default MT-generated entry only if they have been submitted with a rating value greater than 5 (which is the default for MT-generated results).

{{WarningBox|Be '''extremely cautious''' when using this feature as '''you have no way to remove a translation once it has been added''' to Microsoft Translator. You can only re-submit the same translations with a low rating to push it down the list of query results.}}

=== Manual Additions ===

Tikal lets you add translation to Microsoft Translator using the [[Tikal - Translation Commands#Add Translation to a Resource|<code>-a</code> command]]:

tikal -a "This is my test" "C'est mon essai" -sl en -tl fr -ms config.cfg

This will add the French "C'est mon essai" with the English text "This is my test". You can verify this by querying it:

tikal -q "This is my test" -sl en -tl fr -ms config.cfg -opt 70:5

Should give you something like:

= From Microsoft-Translator (en->fr)
Threshold=70, Maximum hits=5
score: 96, origin: 'Microsoft-Translator'
Source: "This is my test"
Target: "C'est mon essai"
score: 95, origin: 'Microsoft-Translator'
Source: "This is my test"
Target: "Il s'agit de mon test"

Microsoft Translator results come back with two possible values:

* The <code>MatchDegree</code> is a value between 0 and 100 indicating how close the source of the result is from the source of the query.
* The <code>Rating</code> is a value between -10 and 10 indicating how good or bad the translation is. The lower the value, the worst the translation. This value is not always present and its default is 5.

The Okapi connector has currently only one score to carry both information. So, for any <code>MatchDegree</code> above 90, we add the <code>Rating</code> minus 10. For example, a normal MT result will have a <code>MatchDegree</code> of 100 and a <code>Rating</code> of 5. Therefore its score is 95: 100+(5-10). An exact match rated at 6 (so better than 5) will be 96: 100+(6-10), etc. For results below 90, the <code>Rating</code> is not taken into account.

=== With the [[Microsoft Batch Submission Step]] ===

The [[Microsoft Batch Submission Step]] takes advantage of the <code>AddTranslationArray</code> method of the API and allows you to submit human or post-edited translations to Microsoft Translator's repository.

For example, to submit the segments of a TMX file for which Okapi has a filter, you can use the following pipeline:

: = [[Raw Document to Filter Events Step]]
: + [[Microsoft Batch Submission Step]]

See the article "[[How to Create a Pipeline in Rainbow]]" to learn about pipelines

See the video "[http://youtu.be/mAjwczqfvAA Importing TMX File into Microsoft Translator Engine]" for a short demonstration on how to use such pipeline to feed a TMX file into Microsoft Translator.

[[Category:Connectors]] [[Category:Tikal]]

Microsoft Translator Connector

2019-04-25T03:21:33Z

Kuro2: Added minimum info about MS API v3

{{Connectors Header}}
__TOC__
==Retirement of version 2 API==
MICROSOFT CONNECTOR of the Okapi stable releases will STOP WORKING at the end of April, 2019.

Microsoft will retire their version 2 API on 2019-4-30 as described in [https://docs.microsoft.com/en-us/azure/cognitive-services/translator/migrate-to-v3 this page].
Because of this, the Microsoft Connector found in the latest stable release, M37, will no longer work on and after 2019-5-01.

The support of the version 3 API has been added to Okapi in mid April after the M37 release. If you need to use Microsoft's machine translation service, please pick up the M38 snapshot version from [http://okapiframework.org/snapshots/ here].
Please note this is a minimal implementation and it does not support any new features such as profanity filtering,

Because the version 3 API no longer supports the translation memory, that aspect of function is not available even if you use the latest Okapi M38 snapshot version.

You will need an "azure key" to use the version 3 API. If you already have a key for version 2, the same key should work.
For information on how to obtain an azure key, please see [https://azure.microsoft.com/en-us/pricing/details/cognitive-services/ this page].

'''Information below is mostly out of date. It is kept as reference until full update of this page is done.'''

==Overview==
The Microsoft MT engine is freely available from Microsoft at [http://www.microsofttranslator.com http://www.microsofttranslator.com]. Volume limitations apply. The engine supports a large number of language pairs, both common and less common. The list is available at [http://www.microsofttranslator.com/help http://www.microsofttranslator.com/help].

This connector uses the HTTP v2 API. You can get more information about the API and its terms here: [http://sdk.microsofttranslator.com http://sdk.microsofttranslator.com].

To use this connector you need a "Azure Key" from Microsoft. See [https://translatorbusiness.uservoice.com/knowledgebase/articles/1078534-microsoft-translator-on-azure#signup the Microsoft pages] for more information.

You must also respect Microsoft's Terms of Service. If you intend to use the Microsoft Translator API for commercial or high volume purposes, you would need to sign a commercial license agreement and provide your AppID to the Microsoft Translator team. For more details contact [mailto:mtlic@microsoft.com mtlic@microsoft.com].

The engine supports inline codes.

When using the query functions of this connector, you are accessessing a remote server and makes your '''source text''' available to Microsoft, but no corresponding translation is sent to Microsoft when doing queries.

For more examples on how to use this connector see the article "[[Trying out the Microsoft Translator Connector]]" in the [[Knowledge Base]]. See also the [[Microsoft Batch Translation Step]].

==== Calculation of the combined score ====

The original score of the query is preserved in the <code>score</code> field of the query result.

The <code>combinedScore</code> of the query result holds a re-calculated value that takes into account both the <code>MatchDegree</code> and <code>Rating</code> values returned by the engine.

For the results with a <code>MatchDegree</code> or 90 or above, the combined score is computed by adding the <code>Rating</code> value minus 10. For the results with a <code>MatchDegree</code> below 90, the combined score is simply the <code>MatchDegree</code>.

{| border="1" cellpadding="5" cellspacing="0"
|+
| '''MatchDegree''' || '''Rating''' || '''Combined Score'''
|- valign="top"
| 100 || 5 || 95 (i.e. 100+(5-10))
|- valign="top"
| 100 || 6 || 96 (i.e. 100+(6-10))
|- valign="top"
| 100 || 0 || 90 (i.e. 100+(0-10))
|- valign="top"
| 100 || -3 || 87 (i.e. 100+(-3-10))
|- valign="top"
| 98 || 9 || 97 (i.e. 98+(9-10))
|- valign="top"
| 95 || 5 || 90 (i.e. 95+(5-10))
|}

Such calculation is far from perfect especially between highly rated high fuzzy matches and a low rated exact matches. But such entries are difficult to rank even manually. We will try to improve this scoring and welcome any feedback you may have.

If a result has no <code>Rating</code> the default is set to 5. Unverified MT translation will generally return a <code>MatchDegree</code> of 100 and a <code>Rating</code> of 5, which will compute into a combined score of 95 in the Okapi interface.

==Parameters==

===Starting with M32:===

<cite>Azure Key</cite> — The Microsoft Azure key to connect to the MT server. See See [https://translatorbusiness.uservoice.com/knowledgebase/articles/1078534-microsoft-translator-on-azure#signup the Microsoft pages] for more information.

<cite>Category</cite> — An optional category to use when working with trained engines.

Example of a configuration file:

#v1
azureKey=myAzureKey
category=

===Prior M32:===

<cite>Client ID</cite> — The Client ID to use to connect to the MT server. See See [http://msdn.microsoft.com/en-us/library/hh454950.aspx the MSDN pages] for more information.

<cite>Secret</cite> — The secret corresponding to the Client ID.

<cite>Category</cite> — An optional category to use when working with trained engines.

Example of a configuration file:

#v1
clientId=myPersonalClientID
secret=theSecretForThatClientID

==Limitations==

* The engine may, on occasion, not generate back the proper spaces. This happens especially when there are inline codes present in the source.

[[Category:Connectors]]

Rainbow - Command Line

2018-10-01T07:17:46Z

Kuro2: Adding explanation about how to define a pipeline

{{Rainbow Common Menu}}

When starting Rainbow has different behaviors depending on the arguments it has when starting:

* If Rainbow is started with just one argument: it starts in normal mode and takes the argument as a project file to be loaded.

* If Rainbow is started with more than one argument: it starts in command-line mode and interprets the arguments as described in the table below.

When running in batch mode, the log is saved into a file named <code>rainbowBatchLog.txt</code> in the home directory of the user.

Note that you can also use [[Tikal]] to execute various function from a command line.

The arguments of the command-line can be the following:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code><inputFile>[ -fc <filterConfiguration>]</code>
| Sets the input file, and optionally sets the filter configuration to assign to it. You can specify an absolute or a local filename. The input file root is reset to the folder of the given input file. If a project was loaded, all input files in that projects are removed and the input file root reset.

If you specify several input files (and their filter configurations) the first one will be assigned to the <cite>Input List 1</cite>, the second to the <cite>Input List 2</cite>, etc.

If the filter configuration is not specified in the command line, the default filter (if one can be found) is used.
Input files must be specified prior to an output location being specified (via <code>-o</code>), and the <code>-fc</code> option must always follow an input file.
|- valign="top"
| <code>-p <projectFilename></code>
| Loads an existing project file <code><projectFilename>.</code>
|- valign="top"
| <code>-x <Id></code>
| Executes the [[Rainbow - Utilities|utility or the predefined pipeline]] with the ID <code><Id></code>. This is done after all arguments of the command line have been processed.
|- valign="top"
| <code>-pln <pipelineFilename></code>
| Loads and execute the specified pipeline stored in <code><pipelineFilename></code>. A pipeline file can be created by selecting Utilities -> Edit / Execute Pipeline from the menu bar, adding steps by clicking Add Step... button, and clicking the Save button.
|- valign="top"
| <code>-se <encoding></code>
| Sets the default source encoding to <code><encoding></code>.
|- valign="top"
| <code>-te <encoding></code>
| Sets the default target encoding to <code><encoding></code>.
|- valign="top"
| <code>-sl <langCode></code>
| Sets the source language using <code><langCode></code>.
|- valign="top"
| <code>-tl <langCode></code>
| Sets the target language using <code><langCode></code>.
|- valign="top"
| <code>-opt <optionFilename></code>
| Sets the options file to use for the utility to execute. Use the <code>-np</code> flag to be prompted or not to modify the options when the command line is executed. The options file must be for the utility defined with <code>-x</code>. Note that option file are only for utilities, not predefined pipelines. If a non-default behavior of a predefined configuration is desired, define your own pipeline and then use <code>-pln <pipelineFilename><code>.
|- valign="top"
| <code>-log <logFile></code>
| Sets the path to the log file. If not specified <code>{user.home}/rainbowBatchLog.txt</code> is used.
|- valign="top"
| <code>-np</code>
| No prompt for utility's options.
|- valign="top"
| <code>-o <outputFile></code>
| Sets the output file. If this option is not used and an input file is specified, the output file path and name is build based on the output options of the project (loaded or default). If this option is specified before an input file is provided, an error will be reported in the log.
|- valign="top"
| <code>-pd <directory></code>
| Sets the parameters directory (the location where the filter parameters files are stored). You can use <code>.</code> (dot) to specify the current directory. By default, if not project is loaded, the default parameters directory is the user home directory.
|- valign="top"
| <code>-ir <directory></code>
| Sets the input root directory for the first input list. You can use <code>.</code> (dot) to specify the current directory. This value is also used to set the <code>${inputRootDir}</code> variable that can be used in some path parameters.
|- valign="top"
| <code>-rd <directory></code>
| Sets the root directory. You can use <code>.</code> (dot) to specify the current directory. This value is also used to set the <code>${rootDir}</code> variable that can be used in some path parameters.
|- valign="top"
| <code>-? or -h</code>
| Opens this help page.
|}

Here are some example of command lines in '''Windows'''. They assume Rainbow is installed in <code>C:\rnb</code> directory.

C:\>java -jar \rnb\lib\rainbow.jar -x TextRewriting -sl EN -tl FR myInput.xlf -o myOutput.xlf

The command-line above executes the Text Rewriting predefined pipeline with the source language set to EN and the target language set to FR. The input document is the XLIFF file <code>myInput.xlf</code>, and the modified file is saved as <code>myOutput.xlf</code>.

C:\>java -jar \rnb\lib\rainbow.jar -x TranslationComparison -sl EN -tl FR -pd . myHumanTrans.xlf myMachineTrans.txt -fc okf_regex@myText

The command-line above executes the Translation Comparison predefined pipeline with the source language set to EN and the target language set to FR. The current folder (<code>.</code>) is specified as the parameters directory. The input file <code>myHumanTrans.xlf</code> is the input document for the <cite>Input List 1</cite>, and the default XLIFF filter configuration assigned to it. The input file <code>myMachineTrans.txt</code> is the input document for the <cite>Input List 2</cite>, and the custom filter parameters <code>okf_regex@myText.fprm</code> is associated with it. No utility options are specified, so the use will be prompted to set the options.

C:\>java -jar \rnb\lib\rainbow.jar -h

The command-line above opens this help page.

----
On '''macOS''', go to /Applications, ~/Applications, or wherever you installed Okapi, and replace the "java -jar \rnb\lib\rainbow.jar" part above with "Rainbow.app/Contents/MacOS/rainbow.sh". For example:

Applications $ Rainbow.app/Contents/MacOS/rainbow.sh -h

[[Category:Rainbow]]

Rainbow - Command Line

2018-09-29T17:23:00Z

Kuro2:

{{Rainbow Common Menu}}

When starting Rainbow has different behaviors depending on the arguments it has when starting:

* If Rainbow is started with just one argument: it starts in normal mode and takes the argument as a project file to be loaded.

* If Rainbow is started with more than one argument: it starts in command-line mode and interprets the arguments as described in the table below.

When running in batch mode, the log is saved into a file named <code>rainbowBatchLog.txt</code> in the home directory of the user.

Note that you can also use [[Tikal]] to execute various function from a command line.

The arguments of the command-line can be the following:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code><inputFile>[ -fc <filterConfiguration>]</code>
| Sets the input file, and optionally sets the filter configuration to assign to it. You can specify an absolute or a local filename. The input file root is reset to the folder of the given input file. If a project was loaded, all input files in that projects are removed and the input file root reset.

If you specify several input files (and their filter configurations) the first one will be assigned to the <cite>Input List 1</cite>, the second to the <cite>Input List 2</cite>, etc.

If the filter configuration is not specified in the command line, the default filter (if one can be found) is used.
Input files must be specified prior to an output location being specified (via <code>-o</code>), and the <code>-fc</code> option must always follow an input file.
|- valign="top"
| <code>-p <projectFilename></code>
| Loads an existing project file <code><projectFilename>.</code>
|- valign="top"
| <code>-x <Id></code>
| Executes the [[Rainbow - Utilities|utility or the predefined pipeline]] with the ID <code><Id></code>. This is done after all arguments of the command line have been processed.
|- valign="top"
| <code>-pln <pipelineFilename></code>
| Loads and execute the specified pipeline stored in <code><pipelineFilename></code>.
|- valign="top"
| <code>-se <encoding></code>
| Sets the default source encoding to <code><encoding></code>.
|- valign="top"
| <code>-te <encoding></code>
| Sets the default target encoding to <code><encoding></code>.
|- valign="top"
| <code>-sl <langCode></code>
| Sets the source language using <code><langCode></code>.
|- valign="top"
| <code>-tl <langCode></code>
| Sets the target language using <code><langCode></code>.
|- valign="top"
| <code>-opt <optionFilename></code>
| Sets the options file to use for the utility to execute. Use the <code>-np</code> flag to be prompted or not to modify the options when the command line is executed. The options file must be for the utility defined with <code>-x</code>. Note that option file are only for utilities, not predefined pipelines.
|- valign="top"
| <code>-log <logFile></code>
| Sets the path to the log file. If not specified <code>{user.home}/rainbowBatchLog.txt</code> is used.
|- valign="top"
| <code>-np</code>
| No prompt for utility's options.
|- valign="top"
| <code>-o <outputFile></code>
| Sets the output file. If this option is not used and an input file is specified, the output file path and name is build based on the output options of the project (loaded or default). If this option is specified before an input file is provided, an error will be reported in the log.
|- valign="top"
| <code>-pd <directory></code>
| Sets the parameters directory (the location where the filter parameters files are stored). You can use <code>.</code> (dot) to specify the current directory. By default, if not project is loaded, the default parameters directory is the user home directory.
|- valign="top"
| <code>-ir <directory></code>
| Sets the input root directory for the first input list. You can use <code>.</code> (dot) to specify the current directory. This value is also used to set the <code>${inputRootDir}</code> variable that can be used in some path parameters.
|- valign="top"
| <code>-rd <directory></code>
| Sets the root directory. You can use <code>.</code> (dot) to specify the current directory. This value is also used to set the <code>${rootDir}</code> variable that can be used in some path parameters.
|- valign="top"
| <code>-? or -h</code>
| Opens this help page.
|}

Here are some example of command lines in '''Windows'''. They assume Rainbow is installed in <code>C:\rnb</code> directory.

C:\>java -jar \rnb\lib\rainbow.jar -x TextRewriting -sl EN -tl FR myInput.xlf -o myOutput.xlf

The command-line above executes the Text Rewriting predefined pipeline with the source language set to EN and the target language set to FR. The input document is the XLIFF file <code>myInput.xlf</code>, and the modified file is saved as <code>myOutput.xlf</code>.

C:\>java -jar \rnb\lib\rainbow.jar -x TranslationComparison -sl EN -tl FR -pd . myHumanTrans.xlf myMachineTrans.txt -fc okf_regex@myText

The command-line above executes the Translation Comparison predefined pipeline with the source language set to EN and the target language set to FR. The current folder (<code>.</code>) is specified as the parameters directory. The input file <code>myHumanTrans.xlf</code> is the input document for the <cite>Input List 1</cite>, and the default XLIFF filter configuration assigned to it. The input file <code>myMachineTrans.txt</code> is the input document for the <cite>Input List 2</cite>, and the custom filter parameters <code>okf_regex@myText.fprm</code> is associated with it. No utility options are specified, so the use will be prompted to set the options.

C:\>java -jar \rnb\lib\rainbow.jar -h

The command-line above opens this help page.

----
On '''macOS''', go to /Applications, ~/Applications, or wherever you installed Okapi, and replace the "java -jar \rnb\lib\rainbow.jar" part above with "Rainbow.app/Contents/MacOS/rainbow.sh". For example:

/Applications $ Rainbow.app/Contents/MacOS/rainbow.sh -h

[[Category:Rainbow]]

Rainbow - Command Line

2018-09-29T17:21:06Z

Kuro2: macOS note added

{{Rainbow Common Menu}}

When starting Rainbow has different behaviors depending on the arguments it has when starting:

* If Rainbow is started with just one argument: it starts in normal mode and takes the argument as a project file to be loaded.

* If Rainbow is started with more than one argument: it starts in command-line mode and interprets the arguments as described in the table below.

When running in batch mode, the log is saved into a file named <code>rainbowBatchLog.txt</code> in the home directory of the user.

Note that you can also use [[Tikal]] to execute various function from a command line.

The arguments of the command-line can be the following:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code><inputFile>[ -fc <filterConfiguration>]</code>
| Sets the input file, and optionally sets the filter configuration to assign to it. You can specify an absolute or a local filename. The input file root is reset to the folder of the given input file. If a project was loaded, all input files in that projects are removed and the input file root reset.

If you specify several input files (and their filter configurations) the first one will be assigned to the <cite>Input List 1</cite>, the second to the <cite>Input List 2</cite>, etc.

If the filter configuration is not specified in the command line, the default filter (if one can be found) is used.
Input files must be specified prior to an output location being specified (via <code>-o</code>), and the <code>-fc</code> option must always follow an input file.
|- valign="top"
| <code>-p <projectFilename></code>
| Loads an existing project file <code><projectFilename>.</code>
|- valign="top"
| <code>-x <Id></code>
| Executes the [[Rainbow - Utilities|utility or the predefined pipeline]] with the ID <code><Id></code>. This is done after all arguments of the command line have been processed.
|- valign="top"
| <code>-pln <pipelineFilename></code>
| Loads and execute the specified pipeline stored in <code><pipelineFilename></code>.
|- valign="top"
| <code>-se <encoding></code>
| Sets the default source encoding to <code><encoding></code>.
|- valign="top"
| <code>-te <encoding></code>
| Sets the default target encoding to <code><encoding></code>.
|- valign="top"
| <code>-sl <langCode></code>
| Sets the source language using <code><langCode></code>.
|- valign="top"
| <code>-tl <langCode></code>
| Sets the target language using <code><langCode></code>.
|- valign="top"
| <code>-opt <optionFilename></code>
| Sets the options file to use for the utility to execute. Use the <code>-np</code> flag to be prompted or not to modify the options when the command line is executed. The options file must be for the utility defined with <code>-x</code>. Note that option file are only for utilities, not predefined pipelines.
|- valign="top"
| <code>-log <logFile></code>
| Sets the path to the log file. If not specified <code>{user.home}/rainbowBatchLog.txt</code> is used.
|- valign="top"
| <code>-np</code>
| No prompt for utility's options.
|- valign="top"
| <code>-o <outputFile></code>
| Sets the output file. If this option is not used and an input file is specified, the output file path and name is build based on the output options of the project (loaded or default). If this option is specified before an input file is provided, an error will be reported in the log.
|- valign="top"
| <code>-pd <directory></code>
| Sets the parameters directory (the location where the filter parameters files are stored). You can use <code>.</code> (dot) to specify the current directory. By default, if not project is loaded, the default parameters directory is the user home directory.
|- valign="top"
| <code>-ir <directory></code>
| Sets the input root directory for the first input list. You can use <code>.</code> (dot) to specify the current directory. This value is also used to set the <code>${inputRootDir}</code> variable that can be used in some path parameters.
|- valign="top"
| <code>-rd <directory></code>
| Sets the root directory. You can use <code>.</code> (dot) to specify the current directory. This value is also used to set the <code>${rootDir}</code> variable that can be used in some path parameters.
|- valign="top"
| <code>-? or -h</code>
| Opens this help page.
|}

Here are some example of command lines in Windows. They assume Rainbow is installed in <code>C:\rnb</code> directory.

C:\>java -jar \rnb\lib\rainbow.jar -x TextRewriting -sl EN -tl FR myInput.xlf -o myOutput.xlf

The command-line above executes the Text Rewriting predefined pipeline with the source language set to EN and the target language set to FR. The input document is the XLIFF file <code>myInput.xlf</code>, and the modified file is saved as <code>myOutput.xlf</code>.

C:\>java -jar \rnb\lib\rainbow.jar -x TranslationComparison -sl EN -tl FR -pd . myHumanTrans.xlf myMachineTrans.txt -fc okf_regex@myText

The command-line above executes the Translation Comparison predefined pipeline with the source language set to EN and the target language set to FR. The current folder (<code>.</code>) is specified as the parameters directory. The input file <code>myHumanTrans.xlf</code> is the input document for the <cite>Input List 1</cite>, and the default XLIFF filter configuration assigned to it. The input file <code>myMachineTrans.txt</code> is the input document for the <cite>Input List 2</cite>, and the custom filter parameters <code>okf_regex@myText.fprm</code> is associated with it. No utility options are specified, so the use will be prompted to set the options.

C:\>java -jar \rnb\lib\rainbow.jar -h

The command-line above opens this help page.

----
On '''macOS''', go to /Applications, ~/Applications, or wherever you installed Okapi and replace the "java -jar \rnb\lib\rainbow.jar" part above with "Rainbow.app/Contents/MacOS/rainbow.sh". For example:

/Applications $ Rainbow.app/Contents/MacOS/rainbow.sh -h

[[Category:Rainbow]]

Markdown Filter

2018-09-06T18:55:34Z

Kuro2:

{{Filters Header}}
==Overview==

The Markdown Filter is an Okapi component for extracting translatable text from Markdown files. See https://en.wikipedia.org/wiki/Markdown for more information about the format.
Markdown is a family of formats, not all of them mutually compatible. This filter is designed to work with markdown based on the [http://commonmark.org CommonMark] specification, with additional features to support [https://guides.github.com/features/mastering-markdown/ GitHub-flavored Markdown].

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input file using the following logic:

If the file has a Unicode Byte-Order-Mark:
Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.

===HTML Elements===
The HTML Inline Elements, i.e. the tags, and the HTML Block, a chunk of text sandwiched between a block-forming start tag and its corresponding end tag, are processed by the HTML filter. The HTML filter to use can be customized separately.

===Inline Codes===
The [[HTML_Filter#Inline_Code_Finder|Inline Code Finder]] is supported by this filter.

The subfilter applies to the translatable text within the proper part of Markdown document. It does not apply to the HTML inline tags or HTML blocks. For that, you would need to enable and specify the inline code pattern for the HTML filter separately, name the configuration as okf_html@''arbitary-name''.fprm, and specify that name for the htmlSubfilter parameter.

Note, the support of the Inline Code Finder was temporarily unavailable in some snapshot builds of version 0.36, but it has been restored.

==Parameters==

; Translate Hyperlink URLs (translateUrls)
: By default, URLs in link and image statements are not exposed for translation. If this option is enabled, they will be extracted. Note: URLs are currently extracted inline in their containing segment, rather than as a subflow. Default: false

; REGEX Pattern for Translatable URLs (urlToTranslatePattern)
: When translateUrls=true, only the URLs that match this REGEX will be extracted. Default: .+ (all URLs)

; Translate Code Blocks (translateCodeBlocks)
: This option controls whether the contents of fenced code blocks are exposed for translation. Default: true

; Translate YAML Metadata Header (translateImageAltText)
: Some markdown formats support a [http://pandoc.org/MANUAL.html#extension-yaml_metadata_block YAML Metadata Header] that contains key/value data. By default, this header is not exposed for translation. When the "Translate YAML Metadata Header" option is enabled, the header will be parsed and the metadata values will be exposed for translation. Default: false

; Translate Image Alt Text (translateImageAltText)
: The alt text for a graphic image in the form of <nowiki>![alt text](https://foo.com/images/bar.jpg)</nowiki> or as the alt attribute of an img tag <nowiki><img src="https://foo.com/images/bar.jpg" alt="alt text"></nowiki> will be extracted if this parameter is true. Default: true.

; HTML Subfilter Configuration ID (htmlSubfilter)
: The custom configuration ID of the HTML filter that will be called to process HTML contents within Markdown documents. The configuration file must be saved in a known location with ''.fprm'' suffix. Specify nothing to use the default HTML filter configuration tailored for the Markdown filter. Default: (empty)

; Enter non translatable block quotes (nonTranslateBlocks)
: This option prevents some block quotes from translation. Block quotes that start with one of comma separated strings will not be extracted. Default: (empty - contents in all block quotes will be extracted)

; Use Code Finder (useCodeFinder)
: Determines whether to use the Inline Code Finder or not. Default: false

; Number of Code Finder Rules (codeFinderRules.count)
: The number of rules, i.e. regular expression patterns. Default: 1

; Code Finder Rule ''N'' (codeFinderRules.rule''N'')
: ''N''th matching pattern for codes where ''N''=0,1,2...

; Sample Text (codeFinderRules.sample)
: Sample text to test the rules on UI.

; Use All Rules (codeFinderRules.useAllRulesWhenTesting)
: Determines whether to apply all rules when testing on UI.

==Limitations==

=== Subflows are Not Supported ===

When there is a subflow of text in the middle of the main text, the subflow will be inter-mixed with the main flow of text. For example, for this run of Markdown text:

<pre>
Please click ![The Information desk logo](images/circled-i.jpg) for help.
</pre>

The extracted text in the XLIFF file will look like this:

<pre>
Please click <x id="1"/>The Information desk logo<x id="2/> for help.
</pre>

[[Category:Filters]]

Markdown Filter

2018-09-06T18:52:19Z

Kuro2: Updating to match with M36 release

{{Filters Header}}
==Overview==

The Markdown Filter is an Okapi component for extracting translatable text from Markdown files. See https://en.wikipedia.org/wiki/Markdown for more information about the format.
Markdown is a family of formats, not all of them mutually compatible. This filter is designed to work with markdown based on the [http://commonmark.org CommonMark] specification, with additional features to support [https://guides.github.com/features/mastering-markdown/ GitHub-flavored Markdown].

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input file using the following logic:

If the file has a Unicode Byte-Order-Mark:
Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.

===HTML Elements===
The HTML Inline Elements, i.e. the tags, and the HTML Block, a chunk of text sandwiched between a block-forming start tag and its corresponding end tag, are processed by the HTML filter. The HTML filter to use can be customized separately.

===Inline Codes===
The [[HTML_Filter#Inline_Code_Finder|Inline Code Finder]] is supported by this filter.

The subfilter applies to the translatable text within the proper part of Markdown document. It does not apply to the HTML inline tags or HTML blocks. For that, you would need to enable and specify the inline code pattern for the HTML filter separately, name the configuration as okf_html@''arbitary-name''.fprm, and specify that name for the htmlSubfilter parameter.

Note, the support of the Inline Code Finder was temporarily unavailable in some snapshot builds of version 0.36, but it has been restored.

==Parameters==

; Translate Hyperlink URLs (translateUrls)
: By default, URLs in link and image statements are not exposed for translation. If this option is enabled, they will be extracted. Note: URLs are currently extracted inline in their containing segment, rather than as a subflow. Default: false

; REGEX Pattern for Translatable URLs (urlToTranslatePattern)
: When translateUrls=true, only the URLs that match this REGEX will be extracted. Default: .+ (all URLs)

; Translate Code Blocks (translateCodeBlocks)
: This option controls whether the contents of fenced code blocks are exposed for translation. Default: true

; Translate YAML Metadata Header (translateImageAltText)
: Some markdown formats support a [http://pandoc.org/MANUAL.html#extension-yaml_metadata_block YAML Metadata Header] that contains key/value data. By default, this header is not exposed for translation. When the "Translate YAML Metadata Header" option is enabled, the header will be parsed and the metadata values will be exposed for translation. Default: false

; Translate Image Alt Text (translateImageAltText)
: The alt text for a graphic image in the form of <nowiki>![alt text](https://foo.com/images/bar.jpg)</nowiki> or as the alt attribute of an img tag <nowiki><img src="https://foo.com/images/bar.jpg" alt="alt text"></nowiki> will be extracted if this parameter is true. Default: true.

; HTML Subfilter Configuration ID (htmlSubfilter)
: The custom configuration ID of the HTML filter that will be called to process HTML contents within Markdown documents. The configuration file must be saved in a known location with ''.fprm'' suffix. Specify nothing to use the default HTML filter configuration tailored for the Markdown filter. Default: (empty)

; Enter non translatable block quotes (nonTranslateBlocks)
: This option prevents some block quotes from translation. Block quotes that start with one of comma separated strings will not be extracted. Default: (empty - contents in all block quotes will be extracted)

; Use Code Finder (useCodeFinder)
: Determines whether to use the Inline Code Finder or not. Default: false

; Number of Code Finder Rules (codeFinderRules.count)
: The number of rules, i.e. regular expression patterns. Default: 1

; Code Finder Rule ''N'' (codeFinderRules.rule''N'')
: ''N''th matching pattern for codes where ''N''=0,1,2...

; Sample Text (codeFinderRules.sample)
: Sample text to test the rules on UI.

; Use All Rules (codeFinderRules.useAllRulesWhenTesting)
: Determines whether to apply all rules when testing on UI.

==Limitations==

=== Subflows are Not Supported ===

When there is a subflow of text in the middle of the main text, the subflow will be inter-mixed with the main flow of text. For example, for this run of Markdown text:

<nowiki>
Please click ![The Information desk logo](images/circled-i.jpg) for help.
</nowiki>

The extracted text in the XLIFF file will look like this:

<nowiki>
Please click <x id="1"/>The Information desk logo<x id="2/> for help.
</nowiki>

[[Category:Filters]]