Tikal

User Guide

If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://okapiframework.org/wiki/index.php?title=Tikal

Tikal is a command-line tool that provides the following basic localization-related utilities:

Note: The command -e (Edit Filter Configurations) requires access to UI editors that are available only if you have one of the okapi-apps platform-specific distributions. It is not available with the okapi-lib cross-platform distribution.

Extract Files

This command extracts the translatable content of one or more given files into an XLIFF document. You can then use any XLIFF-aware translation tool to translate the document. Example of open-source tools that are XLIFF-capable are (among others): OmegaT, Virtaal, Qt Linguist, and Lokalize. When the translation is done, you can use the Merge command to create a new translated file in its original format.

The XLIFF documents created are placed in the same directories as the original files, and have the same name with an additional .xlf extension.

By default, some extensions are mapped to a specific filter configuration (for example: .docx to okf_openxml, .odt to okf_openoffice, .po to okf_po, etc.). But you can define your own configuration and specify it as well using the -fc option. To get a list of all available filter configurations use the List Configurations command. For more details the filters available and their configurations, see each filter's documentation.

You can use the -seg option to specify that the extracted text should be segmented. Use -seg without filename to use the default segmentation rules, use "-seg myRules.srx" to specify your own rules. The rules file must be in SRX format. The segments are marked up according the XLIFF 1.2 specifications.

The syntax of this command is:

-x [options] inputFile [inputFile2...]

Where the options are:

-fc configId The identifier of the filter configuration to use for the extraction.
-ie encoding The encoding name of the input files. this is used only if the filter cannot detect the encoding from the input file itself.
-sl srcLang The code of the source language of the input files.
-tl trgLang The code of the target language for the output (also used in the input if the input documents are multilingual).
-seg [srxFile]

The segmentation rules to utilize. To specify the default rules that come with the installation, use -seg without filename. The default rules are in config/defaultSegmentation.srx in your Okapi main directory.

-pen tmDirectory|
-tt hostname[:port]|
-gs configFile|
-mm key|
-google|
-apertium [configFile]
-ms configFile|
-tda configFile
A translation resource to use to translate the document: -pen for a Pensieve TM, -tt for a Translate Toolkit TM, -gs for a GlobalSight TM, -mm for MyMemory repository, -google for Google MT-apertium for an Apertium server., -ms for Microsoft MT, and -tda for the TDA translation repository. The leveraging occurs after segmentation, if you have specify segmentation rules.

Note that some internet-based resource may be slow and result in lengthy processing time. Be also aware that some translation resources may not always provide a good handling of inline codes.

-opt threshold

TM query option: The threshold is a number between 0 and 100. If this option is not set the default is 95. Note that this option may be limited for some search engines because of the way they are configured.

-maketmx [tmxFile] Generates a TMX document with all the entries leveraged. You can specify the name of the document, if you do not it will be named pretrans.tmx.
-nocopy Ensures that the generated XLIFF files do not have a copy of the source text in the target entries if the original target does not exists.

For example:

tikal -x *.docx *.html

Extracts all .docx and .html files in the current directory into corresponding .docx.xlf and .html.xlf XLIFF documents. The source language here is the default, which is the current language of the system. The target language by default is fr. No segmentation is done.

tikal -x -sl EN tl DE -fc okf_regex-srt -ie iso-8859-1 findingNemo.srt

Extracts the sub-title file findingNemo.srt into a findingNemo.srt.xlf XLIFF document. The encoding iso-8859-1 is used to process the input file. The filter used is the Regex filter with the pre-define configuration for SRT documents. The source language is English (EN) and the target language is German (DE). No segmentation is done.

tikal -x *.docx -seg -tl BR

Extracts all .docx in the current directory into corresponding .docx.xlf XLIFF documents. The source language here is the default, which is the current language of the system. The target language is Breton. The extracted text units will be segmented according the rules defined in the default SRX segmentation rules file (located in the config sub-directory in your Okapi main directory).

Merge Files

This command merges back into their original format one or more XLIFF documents that were created using the Extract command. You must have the original files in the same directories as their corresponding XLIFF documents.

The XLIFF document names must be the name of the original files with an additional .xlf extension. The new documents are created in the directories where the XLIFF documents are, with a .out extension pre-pended to the original extension. For example, if your original file is myFile.html, the XLIFF document should be myFile.html.xlf, and the merged file will be myFile.out.html.

The syntax of this command is:

-m [options] xliffFile [xliffFile2...]

Where the options are:

-fc configId The identifier of the filter configuration to use for the re-extraction of the original file.
-ie encoding The encoding name of the original files. This is used only if the filter cannot detect the encoding from the input file itself.
-oe encoding The encoding name of the file to generate. The same encoding as the input file will be used if this option is not specified.
-sl srcLang The code of the source language.
-tl trgLang The code of the target language.

For example:

tikal -m *.xlf -sl EN -tl DE

Merges all XLIFF documents in the directory. The original files should be in the same directory as well. The source language is English and the target language is German.

Translate Files

This command creates a pre-translated version of the input files. It is basically the same thing as running an Extraction command (with pre-translation) immediately followed by a Merge command.

By default, some extensions are mapped to a specific filter configuration (for example: .docx to okf_openxml, .odt to okf_openoffice, .po to okf_po, etc.). But you can define your own configuration and specify it as well using the -fc option. To get a list of all available filter configurations use the List Configurations command. For more details the filters available and their configurations, see each filter's documentation.

You can use the -seg option to specify that the extracted text should be segmented. Use -seg without filename to use the default segmentation rules, use "-seg myRules.srx" to specify your own rules. The rules file must be in SRX format.

The syntax of this command is:

-t [options] inputFile [inputFile2...]

Where the options are:

-fc configId The identifier of the filter configuration to use for the extraction.
-ie encoding The encoding name of the input files. this is used only if the filter cannot detect the encoding from the input file itself.
-oe encoding The encoding name of the output file to generate. The same encoding as the input file will be used if this option is not specified.
-sl srcLang The code of the source language of the input files.
-tl trgLang The code of the target language for the output (also used in the input if the input documents are multilingual).
-seg [srxFile]

The segmentation rules to utilize. To specify the default rules that come with the installation, use -seg without filename. The default rules are in config/defaultSegmentation.srx in your Okapi main directory.

-pen tmDirectory|
-tt hostname[:port]|
-gs configFile|
-mm key|
-google|
-apertium [configFile]
-ms configFile|
-tda configFile
A translation resource to use to translate the document: -pen for a Pensieve TM, -tt for a Translate Toolkit TM, -gs for a GlobalSight TM, -mm for MyMemory repository, -google for Google MT, -apertium for an Apertium server, -ms for Microsoft MT, and -tda for the TDA translation repository. The leveraging occurs after segmentation, if you have specify segmentation rules.

Note that some internet-based resource may be slow and result in lengthy processing time. Be also aware that some translation resources may not always provide a good handling of inline codes.

-opt threshold

TM query option: The threshold is a number between 0 and 100. If this option is not set the default is 95. Note that this option may be limited for some search engines because of the way they are configured.

-maketmx [tmxFile] Generates a TMX document with all the entries leveraged. You can specify the name of the document, if you do not it will be named pretrans.tmx.

For example:

tikal -t *.html -sl en -tl eo -apertium

Translate from English to Esperanto all .html files in the current directory, using the default Apertium MT demonstration server. No segmentation is used.

Query Translation Resources

This command queries one or more translation resources for a given text. By default the query is sent to the Google MT engine, but you can also query Pensieve TMs, GlobalSight TMs, Translate Toolkit TMs, the Open-Tran repository, the MyMemory repository, as well as any Apertium MT server. See the section Translation Resources Details for more information.

You can query all resources at once. When querying several resources, the results are shown per resource, not sorted by best score as a whole.

The syntax of this command is:

-q "text" [options]

Where the options are:

-sl srcLang The code of the source language (language of the text queried)
-tl trgLang The code of the target language (language of the requested translation)
-pen directory Queries a Pensieve TM stored in a given directory.
-opentran Queries the OpenTran translation repository. This requires Internet access.
-gs configFile Queries a GlobalSight TM server. This requires Internet access.
-tt hostname[:port] Queries the specified Translate Toolkit TM server. This assumes you have access to the server (local or remote).
-mm key Queries the MyMemory translation repository with a given key access (use mmDemo123 for demo). This requires Internet access.
-google Queries the Google MT server. If no other type of resource is specified, this is used by default. This requires Internet access.
-apertium [configFile]

Queries the specified Apertium MT server (local or remote). A default remote server is provided.

-ms configFile Queries the Microsoft MT server. This requires Internet access.
-tda configFile Queries the TDA translation repository. this requires Internet access.
-opt threshold[:maxhits]

TM query options: The threshold is a number between 0 and 100. The maximum number of hits is a number above 0. If this option is not set each TM engine uses its own defaults. If this option is set, all TM engines are set to use the specified options. Note that parameters of some engines may be limited by their configuration.

Because the text of the query cannot be associated with a given file format, there is no support for format-specific inline codes. However, when querying a resource that in inline-code aware, you can use HTML-like tags to replace codes: For example, in "Open the <x>window</x><x/>." the tags "<x>", "</x>" and "<x/>" will be interpreted as opening, closing and placeholder inline codes, and the query processed as such. When querying resources that are not inline code-aware, the tags are treated as plain text.

For example:

tikal -q "open file" -sl en

Queries the default translation resource (Google MT system) for the text "open file" in English. The target language by default is French. Note: You could omit the -sl option if you are running from a English system.

tikal -q "open <x>file</x>" -sl en -pen mytm -opt 60:20

Queries the Pensieve TM located in mytm for the text "open <x>file</x>" in English. The target language by default is French. Because Pensieve TM can work with inline codes, the tags "<x>" and "</x>" are processed as inline codes. The threshold is set to 60 and the maximum hits is set to 20.

tikal -q "open file" -opentran -sl en -tl zu

Queries the OpenTran translation repository for the English text "open file" in Zulu.

tikal -q "open file" -tt localhost -sl en -tl af

Queries a local Translate Toolkit TM server located on http://localhost:8080 (note that 8080 is omitted in the command line as it is port by default). The source is English and the requested translation is Afrikaans.

List Filter Configurations

This command lists all the filter configurations available for Tikal. The configurations listed are the ones you can use as filter configurations the the input files (-fc option). This configuration indicates how to extract the document.

The syntax of this command is:

-lfc | --listconf

For example:

tikal -listconf

Lists all the configurations currently available.

Edit Filter Configurations

This command edits or view filter configurations.

Note: This command requires access to UI editors that are available only if you have one of the okapi-apps platform-specific distribution. If you run this command from the okapi-lib cross-platform distribution you will get an error. To edit filter configurations in the okapi-lib distribution, open the .fprm files. Make sure to always save your modifications in UTF-8.

The syntax of this command is:

-e [[-fc] configId]

For example:

tikal -e okf_regex@myConfig

Edits the filter configuration okf_regex@myConfig. This is a user configuration for the RegEx Filter.

tikal -e

Opens the Filter Configurations dialog box, where all the available configurations are listed and can be viewed or edited, and from where you can create new configurations.

Convert to PO Format

Creates a PO file for the give input file. If the input file is multilingual (like a TMX or a TS file), the source and target will be in the PO file.

The syntax of this command is:

-2po [options] inputFile [inputFile2...]

Where the options are:

-fc configId The identifier of the filter configuration to use for the input files
-ie encoding The encoding name of the input files. this is used only if the filter cannot detect the encoding from the input file itself.
-sl srcLang The code of the source language of the input files.
-tl trgLang The code of the target language.
-generic Indicates to use generic notation for inline codes in the generated PO file, for example <1/> vs. <br/>. If this option is not specified the inline codes are output in their original form.
-trgsource|
-trgempty
Forces the content of the output target field to be either a copy of the source or empty. If neither option is set the content of the target field is the target text or empty.
-all Allows entries that have no text to be converted. If this option is not set, the entries that are empty, or contains only codes or whitespaces are not included in the output file. If this option is set all entries are included in the output.

For example:

tikal -2po data.tmx -sl EN -tl ZU

Creates a PO file from the TMX document data.tmx. The source language will be English and the target Zulu.

Convert to TMX Format

Creates a TMX document for the give input file. If the input file is multilingual (like a PO or a TS file), the source and target will be in the TMX document.

The syntax of this command is:

-2tmx [options] inputFile [inputFile2...]

Where the options are:

-fc configId The identifier of the filter configuration to use for the input files
-ie encoding The encoding name of the input files. this is used only if the filter cannot detect the encoding from the input file itself.
-sl srcLang The code of the source language of the input files.
-tl trgLang The code of the target language.
-trgsource|
-trgempty
Forces the content of the output target field to be either a copy of the source or empty. If neither option is set the content of the target field is the target text or empty.
-all Allows entries that have no text to be converted. If this option is not set (the default), the entries that are empty, or contains only codes or whitespaces are not included in the output file. If this option is set all entries are included in the output.

For example:

tikal -2tmx data.po -sl EN -tl ZU

Creates a TMX document from the PO file data.po. The source language will be English and the target Zulu.

tikal -2tmx data.tmx -sl EN -tl DE -trgempty

Creates a TMX document from another TMX document named data.tmx. The source language will be English and the target German. The content of the <tuv> elements for the German will be empty.

Convert to Table Format

Creates a table-like output for the give input file. If the input file is multilingual (like a PO or a TS file), the source and target will be in the output table.

The syntax of this command is:

-2tbl [options] inputFile [inputFile2...]

Where the options are:

-fc configId The identifier of the filter configuration to use for the input files
-ie encoding The encoding name of the input files. this is used only if the filter cannot detect the encoding from the input file itself.
-sl srcLang The code of the source language of the input files.
-tl trgLang The code of the target language.
-trgsource|
-trgempty
Forces the content of the output target field to be either a copy of the source or empty. If neither option is set the content of the target field is the target text or empty.
-csv|
-tab
Output format: csv for comma-separated values, or tab for tab-delimited values.
-xliff|
-xliffgx|
-tmx|
-generic
Inline codes format: xliff for XLIFF, xliffgx for XLIFF with g/x notation, tmx for TMX, or generic for generic placeholders.
-all Allows entries that have no text to be converted. If this option is not set, the entries that are empty, or contains only codes or whitespaces are not included in the output file. If this option is set all entries are included in the output.

For example:

tikal -2tbl data.tmx -sl EN -tl ZU

Creates a tab-delimited file from the TMX document data.tmx. Any inline codes is output in its original form. The source language is English and the target Zulu. Any tab character within the text is escaped with a backslash prefix.

tikal -2tbl data.po -sl EN -tl ES -csv -xliffgx -trgsource

Creates a comma-separated values output file from the PO file data.po. The inline codes are represented as XLIFF elements using the <g> and <x> notation. The text is between double quotes, and any double-quote and backslash characters within the text is escaped with a backslash prefix. The source language is English and the target Spanish. the content of the target column is a copy of the source.

Import into Pensieve TM

Imports the specified input documents into a Pensieve TM database. If the specified TM does not exists, it is created. If it does exist, the input files are added to it.

The syntax of this command is:

-imp myTMdirectory [options] inputFile [inputFile2...]

Where the options are:

-fc configId The identifier of the filter configuration to use for the input files
-ie encoding The encoding name of the input files. this is used only if the filter cannot detect the encoding from the input file itself.
-sl srcLang The code of the source language of the input files.
-tl trgLang The code of the target language.
-trgsource|
-trgempty
Force the content of the output target field to be either a copy of the source or empty. If neither option is set the content of the target field is the target text or empty.
-all Allows entries that have no text to be imported. If this option is not set (the default), the entries that are empty, or contains only codes or whitespaces are not included in the output file. If this option is set all entries are included in the output.

For example:

tikal -imp myTMdir data.po -sl JA -tl FR

Imports the PO file data.po into the TM database located in myTMDir. If the directory does not exists it will be created. If a TM exists, the input file9s) will be added to it. The source language of the PO file is Japanese and the target French.

Export TMX from Pensieve TM

Creates a TMX output file for the the give Pensive TMs.

The syntax of this command is:

-exp tmDirectory [tmDirectory2...] [options]

Where the options are:

-sl srcLang The code of the source language.
-tl trgLang The code of the target language.
-trgsource|
-trgempty
Forces the content of the output target field to be either a copy of the source or empty. If neither option is set the content of the target field is the target text or empty.
-all Allows entries that have no text to be exported. If this option is not set (the default), the entries that are empty, or contains only codes or whitespaces are not included in the output file. If this option is set all entries are included in the output.

For example:

tikal -exp myProjectTM -sl en -tl IT

Creates a TMX document from the Pensieve TM stored in myProjectTM. The source language is English and the target Italian.

Note that the command -exp is a shortcut for the -2tmx command (Convert to TMX Format). You can export Pensieve TM entries in table, TMX or PO format using the filter configuration okf_pensieve. For instance, the example above can be also execute using the following command:

tikal -2tmx myProjectTM -fc okf_pensieve -sl en -tl IT

Translation Resources Details

Tikal provides access to several translation resources, some are machine translation system (MT), some are translation memory system, or some kind or other searchable translation repository. the following resources are available:

Note that some of these resources are Web-based and require an internet connection, others can be installed and used locally.

Pensieve TM

This is the Okapi framework's own TM engine. It is still under development, but can be used. This resource is indicated by the  option -pen in Tikal and takes one argument: the directory of the TM for a local TM, or the URL of the server for a remote TM. If no argument is provided the default host http://localhost:8080 is used by default.

tikal -q "text to translate" -pen myTmDir

Translate Toolkit TM

The Translate Toolkit TM is an engine that comes with the Translate Toolkit, a nicely designed and well supported set of open-source localization tools. You can run the tool called tmserver on your own machine, or use it through the Web. You can find more information about the Translate Toolkit here: http://translate.sourceforge.net/wiki/toolkit/index and the help for tmserver is here: http://translate.sourceforge.net/wiki/toolkit/tmserver.

Note that the Translate Toolkit TM can take PO files as input. You can create such PO file from TMX or other formats using the Convert to PO Format command.

This resource is indicated by the option -tt in Tikal and takes one argument: the host of the server, and optionally the port to use. Use localhost for a server running locally. If the port is omitted it is set to 8080 by default.

tikal -q "text to translate" -tt localhost

GlobalSight TM

The GlobalSight TM engine is part of the open-source GlobalSight System. You need to have access to server with the system installed in order to use this resource. You can find more information about the GlobalSight TMS here: http://www.globalsight.com/

This resource is indicated by the option -gs in Tikal and takes one argument: a configuration file that contains the information about the server to connect to, username and password, as well as the TM profile to use. The configuration file must look like this:

#v1
username=myusername
password=mypassword
serverURL=http://myhost:8080/globalsight/services/AmbassadorWebService?wsdl
tmProfile=myprofile

Open-Tran Translation Repository

The Open-Tran project is a open-source repository of translations of open-source software. It provides access to its entries through Web services. You can find more information about Open-Tran here: http://open-tran.eu/

This resource is indicated by the option -opentran in Tikal (without argument). Note that the results returned by the Open-Tran server have no meangingful scores from a TM viewpoint, so the Okapi connector recalculate, re-sort and re-filter the results. Note also that this resource is available only for the Query command, it would be too slow to use for the other commands.

tikal -q "text to translate" -opentran

MyMemory TM

The MyMemory project is a central repository of public and private TMs. It offers access through a Web service interface, and can provide MT fall-back translation when no good matches are found. You can find more information about MyMemory here: http://mymemory.translated.net/

This resources is indicated by the option -mm and takes one argument: an access key. You can use the default access key mmDemo123 for testing the resource. For real work you may want to open an account at MyMemory.

tikal -q "text to translate" -mm mmDemo123

Google MT

The Google MT engine is well-known and widely used. Okapi provides a connector for the Ajax API. You can find more information about Google MT here: http://code.google.com/apis/ajaxlanguage/.

This resource is indicated by the option -google in Tikal (without argument). It is the default for the Query command.

tikal -q "text to translate" -google

Apertium MT

Apertium is an open-source Rule-based MT project. It provides translation for many language pairs, including for less-common languages such as Catalan, Galician, Welsh, Esperanto, Occitan, etc. You can find more information about Apertium here: http://wiki.apertium.org/. The connector uses the JSONP REST API described here: http://wiki.apertium.org/wiki/Apertium_web_service

This resource is indicated by the option -apertium and takes one optional argument: a configuration file that contains the information about the server to connect to. The configuration file looks something like this:

#v1
server=http://api.apertium.org/json/translate
apiKey=myApiKey

Note that the apiKey parameter is optional. However using an API key is highly recommended. See http://api.apertium.org/register.jsp for details and to register one.

If the configuration file is omitted, the default main public apertium.org server is used, without API key.

tikal -q "text to translate" -apertium myApertium.cfg

Microsoft MT

The Microsoft MT engine is freely available from Microsoft (http://www.microsofttranslator.com/). Okapi provides a connector for the SOAP API. You can get more information about this API and its terms here: http://sdk.microsofttranslator.com/. To use this connector you need a AppID from Microsoft. You can get one at http://www.bing.com/developers/appids.aspx.

This resource is indicated by the option -ms in Tikal and takes one argument: a configuration file that contains the information to connect to and use the Microsoft MT service. The configuration file must look like this:

#v1
appId=mypersonalappid

To use the Microsoft MT engine call Tikal like this (where myMS.cfg is your configuration file).

tikal -q "text to translate" -ms myMS.cfg

TDA Translation Repository

The TAUS Data Association (TDA) offers a public Search facility on a large corpus of translations. Okapi provides a connector for the REST API provided by TDA. The TDA Web Search is accessible from here: http://www.tausdata.org/index.php/taus-search.

This resource is indicated by the option -tda in Tikal and takes one argument: a configuration file that contains the information to access the TDA search service. The configuration file must look like this:

#v1
server=http://www.tausdata.org/api
appKey=myAppKey
username=myTDAUsername
password=myTDAPassword
industry.i=0
contentType.i=0

For example, to access TDA Search, call Tikal like this (where myTDA.cfg is your configuration file):

tikal -q "terms to search" -tda myTDA.cfg -sl en-us -tl fr-fr