Tikal - Extraction Commands

From Okapi Framework
Jump to navigation Jump to search

Extract Files

This command extracts the translatable content of one or more given files into an XLIFF document. You can then use any XLIFF-aware translation tool to translate the document (See "How to Translate XLIFF Documents" for more information). When the translation is done, you can use the Merge Files command to create a new translated file in its original format.

The XLIFF documents created are placed in the same directories as the original files, and have the same name with an additional .xlf extension.

By default, some extensions are mapped to a specific filter configuration (for example: .docx to okf_openxml, .odt to okf_openoffice, .po to okf_po, etc.). But you can define your own configuration and specify it as well using the -fc option. To get a list of all available filter configurations use the List Filter Configurations command. For more details the filters available and their configurations, see each filter's documentation.

You can use the -seg option to specify that the extracted text should be segmented. Use -seg without file name to use the default segmentation rules, use "-seg myRules.srx" to specify your own rules. The rules file must be in SRX format. The segments are marked up according the XLIFF 1.2 specifications.

The syntax of this command is:

-x [options] inputFile [inputFile2...]

Note: -x2 is the same as -x while -x1 extract without generating a skeleton file (and is meant to be merged using -m2). Where the options are:

-fc configId The identifier of the filter configuration to use for the extraction.
-ie encoding The encoding name of the input files. this is used only if the filter cannot detect the encoding from the input file itself.
-sl srcLang The code of the source language of the input files. See more details...
-tl trgLang The code of the target language for the output (also used in the input if the input documents are multilingual). See more details...
-seg [srxFile] The segmentation rules to utilize. To specify the default rules that come with the installation, use -seg without filename. The default rules are in config/defaultSegmentation.srx in your Okapi main directory.
-rd rootDirectory The root directory (by default the user's home directory).
-od outputDirectory The directory where to place the output.
-pen tmDirectory|
-tt hostname[:port]|
-gs configFile|
-mm [key]|
-gg configFile|
-apertium [configFile]|
-ms configFile|
-tda configFile
A translation resource connector to use to translate the document: -pen for the Pensieve TM Connector, -tt for the Translate Toolkit TM Connector, -gs for the GlobalSight TM Connector, -mm for MyMemory TM Connector, -gg for the Google MT v2 Connector, -apertium for the Apertium MT Connector, -ms for the Microsoft Translator Connector, and -tda for the TDA Translation Repository Connector.

The leveraging occurs after segmentation, if you have specify segmentation rules.

Note that some Internet-based resource may be slow and result in lengthy processing time. Be also aware that some translation resources may not always provide a good handling of inline codes.

-opt threshold TM query option: The threshold is a number between 0 and 100. If this option is not set the default is 95. Note that this option may be limited for some search engines because of the way they are configured.
-maketmx [tmxFile] Generates a TMX document with all the entries leveraged. You can specify the name of the document, if you do not it will be named pretrans.tmx.
-nocopy Ensures that the generated XLIFF files do not have a copy of the source text in the target entries if the original target does not exists.
-noalttrans Ensures that the generated XLIFF files do not have added <alt-trans> elements.
-codeattrs Enables the output of extended attributes ctype and equiv-text for inline codes.
-safe Shows a warning before overwriting output files.

For example:

tikal -x *.docx *.html

Extracts all .docx and .html files in the current directory into corresponding .docx.xlf and .html.xlf XLIFF documents. The source language here is the default, which is the current language of the system. The target language by default is fr. No segmentation is done.

tikal -x -sl EN tl DE -fc okf_regex-srt -ie iso-8859-1 findingNemo.srt

Extracts the sub-title file findingNemo.srt into a findingNemo.srt.xlf XLIFF document. The encoding iso-8859-1 is used to process the input file. The filter used is the Regex Filter with the predefined configuration for SRT documents. The source language is English (EN) and the target language is German (DE). No segmentation is done.

tikal -x *.docx -seg -tl BR

Extracts all .docx files in the current directory into corresponding .docx.xlf XLIFF documents. The source language here is the default, which is the current language of the system. The target language is Breton (BR). The extracted text units will be segmented according the rules defined in the default SRX segmentation rules file (located in the config sub-directory in your Okapi main directory).

tikal -x *.odt -od toTrans -tl ZU

Extracts all .odt files in the current directory into corresponding .odt.xlf XLIFF documents into the toTrans sub-directory of the current directory. The source language here is the default, which is the current language of the system. The target language is Breton (ZU). The extracted text units will be segmented according the rules defined in the default SRX segmentation rules file (located in the config sub-directory in your Okapi main directory).

Merge Files

This command merges back into their original format one or more XLIFF documents that were created using the Extract Files command. If you have extracted with -x or -x2 you must have the skeleton files in the same directories as their corresponding XLIFF documents. If you have extracted with -x1 you must have the original files in the same directories as their corresponding XLIFF documents.

Note: If you are merging with the original files (-x1), the original files must be exactly the same as when they were extracted to XLIFF, and the same encoding and filter configuration must be used. Any difference may result in a different extraction and could cause merging errors.

The XLIFF document names must be the name of the original files with an additional .xlf extension. The new documents are created in the directories where the XLIFF documents are, with a .out extension pre-pended to the original extension. For example, if your original file is myFile.html, the XLIFF document should be myFile.html.xlf, and the merged file will be myFile.out.html.

The syntax of this command is:

-m [options] xliffFile [xliffFile2...]

Note: -m2 is the same as -m and use a skeleton file for the merge, while -m1 uses the original document for the merge.

Where the options are:

-fc configId The identifier of the filter configuration to use for the re-extraction of the original file.
-ie encoding The encoding name of the original files. This is used only if the filter cannot detect the encoding from the input file itself.
-oe encoding The encoding name of the file to generate. The same encoding as the input file will be used if this option is not specified.
-sl srcLang The code of the source language. See more details...
-tl trgLang The code of the target language. See more details...
-sd sourceDirectory The directory where to find the source file of the XLIFF document.
-od outputDirectory The directory where to place the output.

For example:

tikal -m *.xlf -sl EN -tl DE

Merges all XLIFF documents in the directory. The skeleton files should be in the same directory as well. The source language is English and the target language is German.

tikal -m toTrans/*.xlf -sl EN -tl ZU -sd . -od xlated

Merges all XLIFF documents in the toTrans sub-directory of the current directory. The skeleton files are in the current directory. The merged files are placed in the xlated sub-directory of the current directory. The source language is English and the target language is Zulu.

Extract Files to Moses

This command extracts the translatable content of one or more given files into a text format usable by Moses. You can then perform various tasks on this document.

The Moses files created are placed in the same directories as the original files, and have the same name with an additional extension that is the code of the source locale.

If the option -2 is set, the target output has the same name as the source output, but with an extension that is the code of the target locale, except if the source file ends with an extension that it the code of the source locale. In that case, the target file takes the name of the source file with the last extension replaced by the code of the target language. For example, if the English source output is out.txt, the target output for French is out.txt.fr. If the English source output is out.en, the target output for French is out.fr.

The syntax of this command is:

-xm [options] inputFile

Where the options are:

-fc configId The identifier of the filter configuration to use for the extraction.
-ie encoding The encoding name of the input files. This is used only if the filter cannot detect the encoding from the input file itself.
-sl srcLang The code of the source language of the input files. See more details...
-tl trgLang The code of the target language (used in the input if the input documents are multilingual). See more details...
-2 Extract two files: one for the source, one for the target. The target file has as many lines as the source file as lines. If there is an existing target segment, the target segment is extracted, otherwise an empty line is used for the missing target.
-to srcOutputFile The path of the Moses source file to generate. The last part of the path is the template filename to use, the code of the source language is added automatically. Warning: You must not use this option if you are processing several files at the same time. If you do so, the output will contain only the result of the last input file.
-seg [srxFile] The segmentation rules to utilize. To specify the default rules that come with the installation, use -seg without filename. The default rules are in config/defaultSegmentation.srx in your Okapi main directory.
-rd rootDirectory The root directory (by default the user's home directory).

For example:

tikal -xm myFile.html

Extracts the HTML file myFile.html in the current directory into corresponding myFile.html.en Moses document, assuming the default source language is English.

tikal -xm myFile.xlf -2 -sl en-us -tl af

Extracts the content of the XLIFF document myFile.xlf into two Moses InlineText files. The first one named myFile.xlf.en-us for the source, the second called myFile.xlf.af for the target.

tikal -xm myFile.xlf -2 -to out.txt -sl en -tl zu

Extracts the content of the XLIFF document myFile.xlf into two Moses InlineText files. The first one named out.txt.en for the source, the second called out.txt.zu for the target.

Leverage Files from Moses

This command takes an input file, and leverages the translation found in its corresponding Moses InlineText files. The initial InlineText file should be created with the Extract Files to Moses command.

The filter configuration, input encoding and segmentation parameters must be the same in this command as they were in the extraction command. This is to ensure the entries between the input file and the corresponding Moses file match one-to-one.

The new documents are created in the directories where the input documents are, with a .out extension pre-pended to the original extension. For example, if your original file is myFile.xlf, and your target language is Zulu, the Moses InlineText document should be (by default) myFile.xlf.zu, and the leveraged result file will be myFile.out.xlf.

The syntax of this command is:

-lm [options] inputFile

Where the options are:

-fc configId The identifier of the filter configuration to use for the extraction.
-ie encoding The encoding name of the input files. This is used only if the filter cannot detect the encoding from the input file itself.
-oe encoding The encoding name of the file to generate. The same encoding as the input file will be used if this option is not specified.
-sl srcLang The code of the source language of the input files. See more details...
-tl trgLang The code of the target language. See more details...
-totrg Copy the leveraged translation into the target, except if there is already an existing target content.
-overtrg Copy the leveraged translation into the target, even if is already an existing target content.
-bpt Use the <bpt>/<ept>/<ph> notation instead of the <g>/<x> notation in &lat-trans> elements.
-from mosesFile The path of the Moses InlineText file from which to leverage the text. If this option is not set, the file to leverage from is the same as the input file with the language code of the target appended as an extension.
-to outputFile The path of the output file to generate. If this option is not set, the output is the same as the input with .out prepended to the file extension.
-seg [srxFile] The segmentation rules to utilize. To specify the default rules that come with the installation, use -seg without filename. The default rules are in config/defaultSegmentation.srx in your Okapi main directory.
-rd rootDirectory The root directory (by default the user's home directory).
-noalttrans Ensures that the generated XLIFF files do not have added <alt-trans> elements.

For example:

tikal -lm myFile.html -tl zh

Puts in the input file myFile.html from the corresponding myFile.html.zh Moses InlineText document. The source language is the default for the platform, while the target is Chinese (zh). The output file is myFile.out.html.

tikal -lm myFile.xlf -sl en -tl ja -from trans.txt -totrg

Leverages the XLIFF file named myFile.xlf in the current directory using the Moses InlineText document named trans.txt. The source language is English and the target is Japanese. The output file is myFile.out.xlf. The Moses translation is copied into the <target> elements only if the element is empty or non-existing in the source document. The <alt-trans> elements are added for all segments.