Okapi Framework - User contributions [en]

Okapi Filters Plugin for OmegaT

2017-04-20T11:44:23Z

Amake:

__TOC__
==Overview==

[http://www.omegat.org/ OmegaT] is a free and open-source translation tool that offers support for many file formats. It also provides a plugin mechanism to use addition filters.

Several of the [[Filters|Okapi filters]] have been packaged into a plugin that works with OmegaT's plugin interface. This allows you to use the filters seamlessly directly from OmegaT.

==Filters Included==

Currently support for the following formats are included:

* Doxygen-commented files (using the [[Doxygen Filter]])
* HTML files (using the [[HTML Filter]])
* InDesign IDML files (using the [[IDML Filter]])
* JSON files (using the [[JSON Filter]])
* Qt TS files (using the [[TS Filter]])
* Trados TagEditor TTX files (using the [[TTX Filter]])
* Transifex projects (using the [[Transifex Filter]])
* Wordfast Pro TXML files (using the [[TXML Filter]])
* XLIFF 1.2 documents (using the [[XLIFF Filter]])
* XLIFF 2 documents ([[#Support for XLIFF 2|see more information]])
* XML files (using the [[XML Filter]])
* XML files (using the [[XML Stream Filter]])
* YAML files (using the [[YAML Filter]])

{{WarningBox|Note that the TTX filter is set by default to auto-detect pre-segmented files.
* If no segments are detected, the filter extract all text by creating its own TTX segmentation.
* If one or more segments are detected '''only the existing segments are passed to OmegaT'''. So if a file is only half segmented you will not get the un-segmented text in OmegaT. In those cases you can:
** define your own filter settings file for TTX
** use [[Rainbow]] to create an OmegaT project where the TTX filter is forced to extract the non-segmented text.}}

Note that several of the formats supported by the plug-in are also supported by OmegaT native filters. You should select which filter to use by enable/disable it in the <cite>File Filters</cite> dialog. If several filters are set for a given format, the first one in the list is used by default.

==Filter Parameters==

Starting in m24 you can specify a custom filter parameters file for each of the filter that supports options.

Use OmegaT's <cite>Options</cite> button in the <cite>File Filters</cite> dialog box to select whether you want to use the default settings, or a custom filter parameters file (<code>.fprm</code> extension) where you have stored your options.

You cannot create or edit the filter parameters file from OmegaT, but you can use [[Rainbow]] to do this (menu <cite>Tools</cite> > <cite>Filter Configurations</cite>).

{{WarningBox|
* All filter parameters files you use in OmegaT must be in the same directory.
* Make sure the parameters files have the extension <code>.fprm</code> and start with the filter identifier. For example: <code>okf_idml@myConfig.fprm</code>, not just <code>myConfig.fprm</code>.}}

==Requirements==
* Starting with m24, make sure you are using Java 1.7 or above (OmegaT 3.6 and earlier can run on lower versions of Java).
* Make sure you have [http://www.omegat.org/en/downloads.html OmegaT 2.2.3 or above].
** OmegaT 4.0.0+ requires version 1.0-m30+ of this plugin.

==Download and Installation==

Download the file <code>okapiFiltersForOmegaT-<version>-dist.zip</code> from:

* '''https://bintray.com/okapi/Distribution/OmegaT_Plugin (for the stable release)'''

* or http://okapiframework.org/snapshots (for the development snapshot)

To install the plugin:

* Locate your OmegaT <code>plugins</code> directory (see your platform below).
* Copy the plugin's JAR file to the <code>plugins</code> directory.
* Restart OmegaT.
* Some file formats (like XLIFF, HTML, etc.) have already native OmegaT filters. OmegaT uses the first enabled filter available. So if you want, for example, OmegaT to use the Okapi XLIFF filter rather than the native OmegaT XLIFF filter, make sure to disable (uncheck) the OmegaT native filter in the <cite>File Filters</cite> dialog (menu: <cite>Options > File Filters</cite>).

===Windows===

On Windows you can install the plugin to the <code>plugins</code> directory where OmegaT is installed (e.g. <code>C:\Program Files\OmegaT</code>) or to your Application Data directory:
* Windows XP: <code>C:\Documents and Settings\<username>\Application Data\OmegaT</code>
* Windows Vista or later: <code>C:\Users\<username>\AppData\Roaming\OmegaT</code>

===Mac OS X===

On OS X you are recommended to install the plugin to <code>/Users/<username>/Library/Preferences/OmegaT/plugins</code>. The <code>Library</code> folder in your home directory may be hidden; to access it from the Finder, select <cite>Go > Go to Folder</cite> from the main menu and enter <code>~/Library/Preferences/OmegaT/plugins</code>.

Okapi requires Java 1.7. The Mac-specific version of OmegaT 3.1.9 or later is bundled with Java 1.8, so you don't need to do anything. If you are running a "without JRE" version or an older version, you will have to install Java 1.7 or later and ensure that OmegaT is launched with it.

===Linux & BSD===

On Linux and BSD you can install the plugin to the <code>plugins</code> directory where OmegaT is installed (alongside <code>OmegaT.jar</code>) or to <code>~/.omegat/plugins</code>.

==Segmentation==

For the file formats that represent segments, such as TTX, be aware that the segmentation created by OmegaT is not carried back into the translated document. For example, an unsegmented paragraph of two sentences may be translated as two separate segments in OmegaT (and produce two TM entries), but it is merged back as a single paragraph (between segment markers because that is the only way to store translation) in the translated TTX file.

You can use the [[Segmentation Step]] in [[Rainbow]], or the [[Tikal - Miscellaneous Commands#Segment_Files|Segmentation command of Tikal]] to create a pre-segmented TTX file before opening it in OmegaT.

Note that any line-break in the source text is considered a segment break by Trados TagEditor, even when it is within an existing segment. Opening a segment that includes a line-break with TagEditor results in a segment withing segment.

==Pre-Translation==

TTX documents may contain segments that are already translated. The translation of such segments is loaded as the current translation in OmegaT.

Note that the target language of the OmegaT project must match the target language specified in the TTX file. The target language of a TTX file is defined in the attribute <code>TargetLanguage</code> of the <code><UserSettings></code> element.

==Testing a Filter==

Some file formats are difficult to extract and merge. If you want to be sure that the translated file merges back properly and is a valid file. One step toward verifying this is to re-extract the merged file and compare the first extraction with the second.

* Open the original file in OmegaT
* Save it.
* Go to the <code>target</code> directory and copy the file you have saved some place else.
* Go back to OmegaT and open the file you have just copied.
* Both files should have the exact same source content. If they do not it is likely that the saved file was not generated properly. You should [http://code.google.com/p/okapi/issues/list fill a bug report] to make sure the problem is corrected.

==Support for XLIFF 2==

{{NoteBox|This feature is available starting from the version 0.25 and is beta.}}

* Only basic core support is implemented
* The <code>translate='no'</code> attribute in unit or annotations is not supported yet.
* The <code>canResegment='no'</code> flags are not supported yet.
* Annotations are stripped out
* No module is supported yet
* Existing target is set as OmegaT "fuzzy match" if its status is "initial" or "translated", not fuzzy for "reviewed" and "final".
* Inline codes are mapped to <code><gN></code>/<code></gN></code> and <code><xN/></code> for now.
* Notes are not displayed in OmegaT.
* etc.

You can find examples of valid XLIFF 2 documents [http://tools.oasis-open.org/version-control/browse/wsvn/xliff/trunk/xliff-20/test-suite/valid/?op=dl&rev=0&isdir=1 in the SVN repository of the XLIFF TC]

== FAQ ==

'''With this plugin installed I get an error when opening my OmegaT project'''

If the error is <code>java.lang.NoSuchMethodError: org.omegat.filters2.FilterContext.getProjectProperties()Lorg/omegat/core/data/ProjectProperties;</code> then you are using an older version of this plugin with a newer version of OmegaT (see [[#Requirements]]). Update your plugin.

[[Category:Filters]]

Okapi Filters Plugin for OmegaT

2017-04-20T11:34:43Z

Amake: /* Download and Installation */

__TOC__
==Overview==

[http://www.omegat.org/ OmegaT] is a free and open-source translation tool that offers support for many file formats. It also provides a plugin mechanism to use addition filters.

Several of the [[Filters|Okapi filters]] have been packaged into a plugin that works with OmegaT's plugin interface. This allows you to use the filters seamlessly directly from OmegaT.

==Filters Included==

Currently support for the following formats are included:

* Doxygen-commented files (using the [[Doxygen Filter]])
* HTML files (using the [[HTML Filter]])
* InDesign IDML files (using the [[IDML Filter]])
* JSON files (using the [[JSON Filter]])
* Qt TS files (using the [[TS Filter]])
* Trados TagEditor TTX files (using the [[TTX Filter]])
* Transifex projects (using the [[Transifex Filter]])
* Wordfast Pro TXML files (using the [[TXML Filter]])
* XLIFF 1.2 documents (using the [[XLIFF Filter]])
* XLIFF 2 documents ([[#Support for XLIFF 2|see more information]])
* XML files (using the [[XML Filter]])
* XML files (using the [[XML Stream Filter]])
* YAML files (using the [[YAML Filter]])

{{WarningBox|Note that the TTX filter is set by default to auto-detect pre-segmented files.
* If no segments are detected, the filter extract all text by creating its own TTX segmentation.
* If one or more segments are detected '''only the existing segments are passed to OmegaT'''. So if a file is only half segmented you will not get the un-segmented text in OmegaT. In those cases you can:
** define your own filter settings file for TTX
** use [[Rainbow]] to create an OmegaT project where the TTX filter is forced to extract the non-segmented text.}}

Note that several of the formats supported by the plug-in are also supported by OmegaT native filters. You should select which filter to use by enable/disable it in the <cite>File Filters</cite> dialog. If several filters are set for a given format, the first one in the list is used by default.

==Filter Parameters==

Starting in m24 you can specify a custom filter parameters file for each of the filter that supports options.

Use OmegaT's <cite>Options</cite> button in the <cite>File Filters</cite> dialog box to select whether you want to use the default settings, or a custom filter parameters file (<code>.fprm</code> extension) where you have stored your options.

You cannot create or edit the filter parameters file from OmegaT, but you can use [[Rainbow]] to do this (menu <cite>Tools</cite> > <cite>Filter Configurations</cite>).

{{WarningBox|
* All filter parameters files you use in OmegaT must be in the same directory.
* Make sure the parameters files have the extension <code>.fprm</code> and start with the filter identifier. For example: <code>okf_idml@myConfig.fprm</code>, not just <code>myConfig.fprm</code>.}}

==Download and Installation==

Download the file <code>okapiFiltersForOmegaT-<version>-dist.zip</code> from:

* '''https://bintray.com/okapi/Distribution/OmegaT_Plugin (for the stable release)'''

* or http://okapiframework.org/snapshots (for the development snapshot)

To install the plugin:

* Starting with m24, make sure you are using Java 1.7 or above (OmegaT can run on lower versions of Java).
* Make sure you have [http://www.omegat.org/en/downloads.html OmegaT 2.2.3 or above].
** Note that OmegaT 4.0.0 or later requires version 1.0-m30 or later of this plugin.
* Locate your OmegaT <code>plugins</code> directory (see your platform below).
* Copy the plugin's JAR file to the <code>plugins</code> directory.
* Restart OmegaT.
* Some file formats (like XLIFF, HTML, etc.) have already native OmegaT filters. OmegaT uses the first enabled filter available. So if you want, for example, OmegaT to use the Okapi XLIFF filter rather than the native OmegaT XLIFF filter, make sure to disable (uncheck) the OmegaT native filter in the <cite>File Filters</cite> dialog (menu: <cite>Options > File Filters</cite>).

===Windows===

On Windows you can install the plugin to the <code>plugins</code> directory where OmegaT is installed (e.g. <code>C:\Program Files\OmegaT</code>) or to your Application Data directory:
* Windows XP: <code>C:\Documents and Settings\<username>\Application Data\OmegaT</code>
* Windows Vista or later: <code>C:\Users\<username>\AppData\Roaming\OmegaT</code>

===Mac OS X===

On OS X you are recommended to install the plugin to <code>/Users/<username>/Library/Preferences/OmegaT/plugins</code>. The <code>Library</code> folder in your home directory may be hidden; to access it from the Finder, select <cite>Go > Go to Folder</cite> from the main menu and enter <code>~/Library/Preferences/OmegaT/plugins</code>.

Okapi requires Java 1.7. The Mac-specific version of OmegaT 3.1.9 or later is bundled with Java 1.8, so you don't need to do anything. If you are running a "without JRE" version or an older version, you will have to install Java 1.7 or later and ensure that OmegaT is launched with it.

===Linux & BSD===

On Linux and BSD you can install the plugin to the <code>plugins</code> directory where OmegaT is installed (alongside <code>OmegaT.jar</code>) or to <code>~/.omegat/plugins</code>.

==Segmentation==

For the file formats that represent segments, such as TTX, be aware that the segmentation created by OmegaT is not carried back into the translated document. For example, an unsegmented paragraph of two sentences may be translated as two separate segments in OmegaT (and produce two TM entries), but it is merged back as a single paragraph (between segment markers because that is the only way to store translation) in the translated TTX file.

You can use the [[Segmentation Step]] in [[Rainbow]], or the [[Tikal - Miscellaneous Commands#Segment_Files|Segmentation command of Tikal]] to create a pre-segmented TTX file before opening it in OmegaT.

Note that any line-break in the source text is considered a segment break by Trados TagEditor, even when it is within an existing segment. Opening a segment that includes a line-break with TagEditor results in a segment withing segment.

==Pre-Translation==

TTX documents may contain segments that are already translated. The translation of such segments is loaded as the current translation in OmegaT.

Note that the target language of the OmegaT project must match the target language specified in the TTX file. The target language of a TTX file is defined in the attribute <code>TargetLanguage</code> of the <code><UserSettings></code> element.

==Testing a Filter==

Some file formats are difficult to extract and merge. If you want to be sure that the translated file merges back properly and is a valid file. One step toward verifying this is to re-extract the merged file and compare the first extraction with the second.

* Open the original file in OmegaT
* Save it.
* Go to the <code>target</code> directory and copy the file you have saved some place else.
* Go back to OmegaT and open the file you have just copied.
* Both files should have the exact same source content. If they do not it is likely that the saved file was not generated properly. You should [http://code.google.com/p/okapi/issues/list fill a bug report] to make sure the problem is corrected.

==Support for XLIFF 2==

{{NoteBox|This feature is available starting from the version 0.25 and is beta.}}

* Only basic core support is implemented
* The <code>translate='no'</code> attribute in unit or annotations is not supported yet.
* The <code>canResegment='no'</code> flags are not supported yet.
* Annotations are stripped out
* No module is supported yet
* Existing target is set as OmegaT "fuzzy match" if its status is "initial" or "translated", not fuzzy for "reviewed" and "final".
* Inline codes are mapped to <code><gN></code>/<code></gN></code> and <code><xN/></code> for now.
* Notes are not displayed in OmegaT.
* etc.

You can find examples of valid XLIFF 2 documents [http://tools.oasis-open.org/version-control/browse/wsvn/xliff/trunk/xliff-20/test-suite/valid/?op=dl&rev=0&isdir=1 in the SVN repository of the XLIFF TC]

[[Category:Filters]]

HTML Filter

2016-11-29T04:28:51Z

Amake: /* Inline Code Finder */ YAML block literal should not have escapes

{{Filters Header}}
==Overview==

The HTML Filter is an Okapi component that implements the IFilter interface for HTML and XHTML documents.

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input document using the following logic:

* If the document has an encoding declaration it is used.
* Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.

===Output Encoding===

If the output encoding is UTF-8:

* If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
* If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

If the input file has no declared encoding, the filter tries to add one in output. A <code><meta></code> tag for HTML files, or a <code><meta /></code> tag for XHTML files. The potential addition is done only if there is a <code><head></code> element in the file.

===Line-Breaks===

The type of line-breaks of the output is the same as the one of the original input.

===Entities===

Character and numeric entities are converted to Unicode. Entities defined in a DTD or schema are passed through without change.

Note that text entity declarations can be processed by the [[DTD Filter]].

==Parameters==

===Built-in Configuration===

The HTML filter does not currently have a user interface to modify its configuration files. By default the HTML filter uses a minimalist configuration file that does not create structural groupings. For example, a table group or list group will never be created.

There is a predefined maximalist configuration (<code>okf_html-wellFormed</code>) that can be used if structural groupings are needed. The caveat is that any structural tags that map to groups must be well formed, that is, they must have a start and end tag. Otherwise the filter return an error.

===HTML Configuration Syntax===

For the truly brave, you can create your own HTML configuration files. These configurations are written in [http://www.yaml.org/ YAML]. See the <code>[https://bitbucket.org/okapiframework/okapi/src/master/okapi/filters/html/src/main/resources/net/sf/okapi/filters/html/wellformedConfiguration.yml wellformedConfiguration.yml]</code> and <code>[https://bitbucket.org/okapiframework/okapi/src/master/okapi/filters/html/src/main/resources/net/sf/okapi/filters/html/nonwellformedConfiguration.yml nonwellformedConfiguration.yml]</code> for examples.

HTML tags are associated with rules. These rules are used by the filter to process the input document.

Notes:

* All attributes and elements names should be in '''lowercase''' in the configuration file, regardless of their casing in the document.
* Element or attributes with a prefix should be declared with the prefix (and between single quotes) in the configuration (e.g. <code>'xml:lang'</code>)

==== Configuring Element Rules ====

The <code>elements</code> section of the configuration consists of a set of key-value pairs. Each key is an element name, and the value is the rules for that element, represented as another set of key-value pairs. An element declaration should include one or more of the available element rules:
{| border="1" cellpadding="5" cellspacing="0"
|-
| <code>ruleTypes</code>
| Basic description of how the filter treats this tag. See [[#Rule Types]].
|-
| <code>idAttributes</code>
| A list containing attributes which may provide the segment ID for text contained within this element.
|-
| <code>conditions</code>
| A condition that further restricts this rule. For example, to indicate that the element should only be handled if it contains an attribute with a certain value. See [[#Condition Syntax]].
|-
| <code>translatableAttributes</code>
| Contains information about translatable attributes in this element. See [[#Configuring Translatable Attributes]].
|-
| <code>elementType</code>
| Indicates the corresponding XLIFF 1.2 <code>type</code> value for this element.
|-
| <code>writableLocalizationAttributes</code>
| Specifies attributes which are writable, but not translatable. (TODO)
|}

==== Rule Types ====
The rules types are the following:

{| border="1" cellpadding="5" cellspacing="0"
|-
| <code>INLINE</code>
| A tag which may occur inside a text run. For example <code><b></code>, <code><i></code>, and <code><u></code>.
|-
| <code>GROUP</code>
| Defines a group of elements that are structurally bound. For example <code><table></code>, <code><div></code> and <code><menu></code>.
|-
| <code>EXCLUDE</code>
| Prevents extraction of any text until the end tag of the same element is found. For example, if the content between a <code><script></code> element should not be extracted then define <code><script></code> as <code>EXCLUDE</code>.
|-
| <code>INCLUDE</code>
| Overrides any current exclusions. This allows exceptions for children of <code>EXCLUDE</code>d elements.
|-
| <code>TEXTUNIT</code>
| A tag that starts a complex text unit. Examples include <code><p></code>, <code><title></code>, <code><h1></code>. Complex text units carry their surrounding tags along with any extracted text.
|-
| <code>PRESERVE_WHITESPACE</code>
| A tag that must preserve its white spaces as-is. For example <code><pre></code>.
|-
| <code>ATTRIBUTES_ONLY</code>
| A tag that has localizable or translatable attributes but does not have translatable content.
|-
| <code>ATTRIBUTE_TRANS</code>
| A translatable attribute.
|-
| <code>ATTRIBUTE_WRITABLE</code>
| A writable or modifiable attribute, but not translatable.
|-
| <code>ATTRIBUTE_READONLY</code>
| A read-only attribute, extracted but that cannot be modified.
|}

==== Configuring Translatable Attributes ====
Translatable attributes may be specified in two ways, depending on the level of complexity needed.

If all the specified attributes should always be translated, they can be exposed as a simple list. For example, the definition for the <code><area></code> element specifies that <code>accesskey</code>, <code>area</code>, and <code>alt</code> attributes are translatable:
<nowiki> area:
ruleTypes: [ATTRIBUTES_ONLY]
translatableAttributes: [accesskey, area, alt]</nowiki>

However, if additional restrictions on translatable attributes are present, the <code>translatableAttributes</code> rule may be specified as a set of key-value pairs, with each key being a translatable attribute and each value being an (optional) list of conditions, using the [[#Condition Syntax]]. For example, this snippet defines the handling of the <code><input></code> element in the built-in configurations:
<nowiki>
input:
ruleTypes: [INLINE]
translatableAttributes:
alt: [type, NOT_EQUALS, [file, hidden, image, password]]
value: [type, NOT_EQUALS, [file, hidden, image, password]]
accesskey: [type, NOT_EQUALS, [file, hidden, image, password]]
title: [type, NOT_EQUALS, [file, hidden, image, password]]</nowiki>

This specifies that there are four attributes (<code>alt</code>, <code>value</code>, <code>accesskey</code>, and <code>title</code>) that are translatable. The translatability of each of these attributes is conditional on the <code><input></code> element not having particular <code>type</code> values.

==== Condition Syntax ====

Rule conditions are expressed as a list of the form
<nowiki>[attribute, operation, value]</nowiki>

{| border="1" cellpadding="5" cellspacing="0"
|-
| <code>attribute</code>
| The name of the attribute which the condition applies to.
|-
| <code>operation</code>
| Available operations are <code>EQUALS</code>, <code>NOT_EQUALS</code>, and <code>MATCHES</code>. <code>EQUALS</code> and <code>NOT_EQUALS</code> test for (case-insensitive) string matches, while <code>MATCHES</code> uses a regular expression.
|-
| <code>value</code>
| The value of the attribute to be compared using the operation.
|}

===Inline Code Finder===

You can define a set of regular expressions to capture span of extracted text that should be treated as inline codes. For example, some element content may have variables that need to be protected from modification and treated as codes. Use the <code>useCodeFinder</code> and <code>codeFinderRules</code> options for this.

useCodeFinder: true
codeFinderRules: "#v1\ncount.i=1\nrule0=\\bVAR\\d\\b"

Note that the regular expression is "<code>\bVAR\d\b</code>" but you must escape the backslash in the YAML notation as well.

You can also use this alternate syntax, which is slightly easier to read:

useCodeFinder: true
codeFinderRules: |-
#v1
count.i=1
rule0=\bVAR\d\b

The options above will set the text "<code>VAR1</code>" as in-line code in the following HTML:

<p>Number of files = VAR1</p>

To facilitate the creation of code finder rules [[Rainbow - Code Finder Editor|Rainbow provides the Code Finder Editor]].

===Character Entity References in Output===

By default extended characters are not using character entity references in output (e.g. <code>&copy;</code> for the character '&copy').

You can change this by specifying the <code>escapeCharacters</code> rule with a string of all the characters you wish to see output as character entity reference. Any specified character that is not extended or has no HTML character entity defined is processed like a normal character.

For example, given the following rule:

escapeCharacters: "© €µÆĄ"

The output of <code><p>© €µÆĄ</p></code> (assuming the output encoding is UTF-8) will be:

<p>&copy;&nbsp;&euro;&micro;&AElig;Ą</p>

Only the character <code>Ą</code> (U+0104) is not represented as an entity reference because there is no HTML character entity defined for it.

===Inline CDATA===

For formats that use CDATA in ways that undesirably break the flow of text, you can set the filter to treat CDATA as if it was an inline element like so:

inlineCdata: true

Then markup such as <code><p>Text with <![CDATA[inline]]> CDATA</p></code> will be extracted as if <code><![CDATA[</code> was a regular inline opening tag and <code>]]></code> was a regular inline closing tag.

===Excluding By Default===

Normally, there is an implicit "default rule" to include elements. If the filter configuration contained no tag information at all, the default behavior of the filter would be to expose all PCDATA for translation. Sometimes it is useful to change this behavior in order to make your configuration more concise. This can be done by setting the <code>exclude_by_default</code> option in your config.

For example, if you wished to have a custom configuration that exposed the translation of the <code><title></code> element but nothing else. You could specify this as

exclude_by_default: true
// .... other configuration
elements:
title:
ruleTypes: [TEXTUNIT]

===Quote Mode===
Escaping of quote and apostrophe (single quote) characters can be changed by adding these lines to the config file:

quoteModeDefined: true
quoteMode: 3

'''Current quote modes:'''

* Do not escape single or double quotes: '''UNESCAPED = 0'''
* Escape single and double quotes to a named entity: '''ALL = 1'''
* Escape double quotes to a named entity, and single quotes to a numeric entity: '''NUMERIC_SINGLE_QUOTES = 2'''
* Escape double quotes only: '''DOUBLE_QUOTES_ONLY = 3'''

==Limitations==

* In the current version of the filter the content of <code><style></code> and <code><script></code> elements is not extracted.
* Tags from server-side scripts such as PHP, ASPX, JSP, etc. are not formally supported and will be treated as non-translatable.

[[Category:Filters]]

HTML Filter

2016-03-26T01:27:53Z

Amake: /* HTML Configuration Syntax */

{{Filters Header}}
==Overview==

The HTML Filter is an Okapi component that implements the IFilter interface for HTML and XHTML documents.

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input document using the following logic:

* If the document has an encoding declaration it is used.
* Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.

===Output Encoding===

If the output encoding is UTF-8:

* If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
* If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

If the input file has no declared encoding, the filter tries to add one in output. A <code><meta></code> tag for HTML files, or a <code><meta /></code> tag for XHTML files. The potential addition is done only if there is a <code><head></code> element in the file.

===Line-Breaks===

The type of line-breaks of the output is the same as the one of the original input.

===Entities===

Character and numeric entities are converted to Unicode. Entities defined in a DTD or schema are passed through without change.

Note that text entity declarations can be processed by the [[DTD Filter]].

==Parameters==

===Built-in Configuration===

The HTML filter does not currently have a user interface to modify its configuration files. By default the HTML filter uses a minimalist configuration file that does not create structural groupings. For example, a table group or list group will never be created.

There is a predefined maximalist configuration (<code>okf_html-wellFormed</code>) that can be used if structural groupings are needed. The caveat is that any structural tags that map to groups must be well formed, that is, they must have a start and end tag. Otherwise the filter return an error.

===HTML Configuration Syntax===

For the truly brave, you can create your own HTML configuration files. These configurations are written in [http://www.yaml.org/ YAML]. See the <code>[https://bitbucket.org/okapiframework/okapi/src/master/okapi/filters/html/src/main/resources/net/sf/okapi/filters/html/wellformedConfiguration.yml wellformedConfiguration.yml]</code> and <code>[https://bitbucket.org/okapiframework/okapi/src/master/okapi/filters/html/src/main/resources/net/sf/okapi/filters/html/nonwellformedConfiguration.yml nonwellformedConfiguration.yml]</code> for examples.

HTML tags are associated with rules. These rules are used by the filter to process the input document.

Notes:

* All attributes and elements names should be in '''lowercase''' in the configuration file, regardless of their casing in the document.
* Element or attributes with a prefix should be declared with the prefix (and between single quotes) in the configuration (e.g. <code>'xml:lang'</code>)

==== Configuring Element Rules ====

The <code>elements</code> section of the configuration consists of a set of key-value pairs. Each key is an element name, and the value is the rules for that element, represented as another set of key-value pairs. An element declaration should include one or more of the available element rules:
{| border="1" cellpadding="5" cellspacing="0"
|-
| <code>ruleTypes</code>
| Basic description of how the filter treats this tag. See [[#Rule Types]].
|-
| <code>idAttributes</code>
| A list containing attributes which may provide the segment ID for text contained within this element.
|-
| <code>conditions</code>
| A condition that further restricts this rule. For example, to indicate that the element should only be handled if it contains an attribute with a certain value. See [[#Condition Syntax]].
|-
| <code>translatableAttributes</code>
| Contains information about translatable attributes in this element. See [[#Configuring Translatable Attributes]].
|-
| <code>elementType</code>
| Indicates the corresponding XLIFF 1.2 <code>type</code> value for this element.
|-
| <code>writableLocalizationAttributes</code>
| Specifies attributes which are writable, but not translatable. (TODO)
|}

==== Rule Types ====
The rules types are the following:

{| border="1" cellpadding="5" cellspacing="0"
|-
| <code>INLINE</code>
| A tag which may occur inside a text run. For example <code><b></code>, <code><i></code>, and <code><u></code>.
|-
| <code>GROUP</code>
| Defines a group of elements that are structurally bound. For example <code><table></code>, <code><div></code> and <code><menu></code>.
|-
| <code>EXCLUDE</code>
| Prevents extraction of any text until the end tag of the same element is found. For example, if the content between a <code><script></code> element should not be extracted then define <code><script></code> as <code>EXCLUDE</code>.
|-
| <code>INCLUDE</code>
| Overrides any current exclusions. This allows exceptions for children of <code>EXCLUDE</code>d elements.
|-
| <code>TEXTUNIT</code>
| A tag that starts a complex text unit. Examples include <code><p></code>, <code><title></code>, <code><h1></code>. Complex text units carry their surrounding tags along with any extracted text.
|-
| <code>PRESERVE_WHITESPACE</code>
| A tag that must preserve its white spaces as-is. For example <code><pre></code>.
|-
| <code>ATTRIBUTES_ONLY</code>
| A tag that has localizable or translatable attributes but does not have translatable content.
|-
| <code>ATTRIBUTE_TRANS</code>
| A translatable attribute.
|-
| <code>ATTRIBUTE_WRITABLE</code>
| A writable or modifiable attribute, but not translatable.
|-
| <code>ATTRIBUTE_READONLY</code>
| A read-only attribute, extracted but that cannot be modified.
|}

==== Configuring Translatable Attributes ====
Translatable attributes may be specified in two ways, depending on the level of complexity needed.

If all the specified attributes should always be translated, they can be exposed as a simple list. For example, the definition for the <code><area></code> element specifies that <code>accesskey</code>, <code>area</code>, and <code>alt</code> attributes are translatable:
<nowiki> area:
ruleTypes: [ATTRIBUTES_ONLY]
translatableAttributes: [accesskey, area, alt]</nowiki>

However, if additional restrictions on translatable attributes are present, the <code>translatableAttributes</code> rule may be specified as a set of key-value pairs, with each key being a translatable attribute and each value being an (optional) list of conditions, using the [[#Condition Syntax]]. For example, this snippet defines the handling of the <code><input></code> element in the built-in configurations:
<nowiki>
input:
ruleTypes: [INLINE]
translatableAttributes:
alt: [type, NOT_EQUALS, [file, hidden, image, password]]
value: [type, NOT_EQUALS, [file, hidden, image, password]]
accesskey: [type, NOT_EQUALS, [file, hidden, image, password]]
title: [type, NOT_EQUALS, [file, hidden, image, password]]</nowiki>

This specifies that there are four attributes (<code>alt</code>, <code>value</code>, <code>accesskey</code>, and <code>title</code>) that are translatable. The translatability of each of these attributes is conditional on the <code><input></code> element not having particular <code>type</code> values.

==== Condition Syntax ====

Rule conditions are expressed as a list of the form
<nowiki>[attribute, operation, value]</nowiki>

{| border="1" cellpadding="5" cellspacing="0"
|-
| <code>attribute</code>
| The name of the attribute which the condition applies to.
|-
| <code>operation</code>
| Available operations are <code>EQUALS</code>, <code>NOT_EQUALS</code>, and <code>MATCHES</code>. <code>EQUALS</code> and <code>NOT_EQUALS</code> test for (case-insensitive) string matches, while <code>MATCHES</code> uses a regular expression.
|-
| <code>value</code>
| The value of the attribute to be compared using the operation.
|}

===Inline Code Finder===

You can define a set of regular expressions to capture span of extracted text that should be treated as inline codes. For example, some element content may have variables that need to be protected from modification and treated as codes. Use the <code>useCodeFinder</code> and <code>codeFinderRules</code> options for this.

useCodeFinder: true
codeFinderRules: "#v1\ncount.i=1\nrule0=\\bVAR\\d\\b"

The options above will set the text "<code>VAR1</code>" as in-line code in the following HTML:

<p>Number of files = VAR1</p>

Note that the regular expression is "<code>\bVAR\d\b</code>" but you must escape the backslash in the YAML notation as well.

To facilitate the creation of code finder rules [[Rainbow - Code Finder Editor|Rainbow provides the Code Finder Editor]].

===Character Entity References in Output===

By default extended characters are not using character entity references in output (e.g. <code>&copy;</code> for the character '&copy').

You can change this by specifying the <code>escapeCharacters</code> rule with a string of all the characters you wish to see output as character entity reference. Any specified character that is not extended or has no HTML character entity defined is processed like a normal character.

For example, given the following rule:

escapeCharacters: "© €µÆĄ"

The output of <code><p>© €µÆĄ</p></code> (assuming the output encoding is UTF-8) will be:

<p>&copy;&nbsp;&euro;&micro;&AElig;Ą</p>

Only the character <code>Ą</code> (U+0104) is not represented as an entity reference because there is no HTML character entity defined for it.

===Inline CDATA===

For formats that use CDATA in ways that undesirably break the flow of text, you can set the filter to treat CDATA as if it was an inline element like so:

inlineCdata: true

Then markup such as <code><p>Text with <![CDATA[inline]]> CDATA</p></code> will be extracted as if <code><![CDATA[</code> was a regular inline opening tag and <code>]]></code> was a regular inline closing tag.

===Excluding By Default===

Normally, there is an implicit "default rule" to include elements. If the filter configuration contained no tag information at all, the default behavior of the filter would be to expose all PCDATA for translation. Sometimes it is useful to change this behavior in order to make your configuration more concise. This can be done by setting the <code>exclude_by_default</code> option in your config.

For example, if you wished to have a custom configuration that exposed the translation of the <code><title></code> element but nothing else. You could specify this as

exclude_by_default: true
// .... other configuration
elements:
title:
ruleTypes: [TEXTUNIT]

===Quote Mode===
Escaping of quote and apostrophe (single quote) characters can be changed by adding these lines to the config file:

quoteModeDefined: true
quoteMode: 3

'''Current quote modes:'''

* Do not escape single or double quotes: '''UNESCAPED = 0'''
* Escape single and double quotes to a named entity: '''ALL = 1'''
* Escape double quotes to a named entity, and single quotes to a numeric entity: '''NUMERIC_SINGLE_QUOTES = 2'''
* Escape double quotes only: '''DOUBLE_QUOTES_ONLY = 3'''

==Limitations==

* In the current version of the filter the content of <code><style></code> and <code><script></code> elements is not extracted.
* Tags from server-side scripts such as PHP, ASPX, JSP, etc. are not formally supported and will be treated as non-translatable.

[[Category:Filters]]

Full-Width Conversion Step

2015-10-07T12:21:42Z

Amake: /* Parameters */ Note availability of new options

{{Steps Header}}
__TOC__
==Overview==

This step converts characters in text units from or to full-width form.

Takes: Filter events. Sends: Filter events.

For historical reasons, some Asian character sets have two display forms for some characters: half-width and full-width. This step allows you to convert from one form to the other. The modification is done in the text of the text units for the specified target locale. If there is no text for the specified target, the source text is copied to the target and processed.

==Parameters==

<cite>Convert full width characters to half-width or ASCII equivalents</cite> — Select this option to convert all full-width character to their half-width or ASCII equivalent. For example, the character 'Ｑ' (U+FF31) is converted to 'Q' (U+0051) and the character 'サ' (U+30B5) is converted to 'ｻ' (U+FF7B).

Additional non-Full-width characters can also be converted:

<cite>Include Squared Latin Abbreviations of the CJK Compatibility block</cite> — Set this option to also convert the Squared Latin Abbreviations of the CJK Compatibility block into sequences of non-CJK characters. For example '㏀' (U+33C0) to "kΩ" (U+006B, U+03A9).

<cite>Include special characters of the Letter-Like Symbols block</cite> — Set this option to also convert several characters of the Letter-Like Symbols block to character sequences. The conversions are shown in the following table:

{| border="1" cellpadding="4" cellspacing="0"
|+
| '''Letter-Like Symbol'''
| '''Character sequence'''
|-
|U+2100||a/c
|-
|U+2101||a/s
|-
|U+2105||c/o
|-
|U+2103||°C
|-
|U+2109||°F
|-
|U+2116||No
|-
|U+212A||K
|-
|U+212B||Å
|}

<cite>Include Japanese Katakana and associated punctuation</cite> — Set this option to convert Japanese Katakana and associated punctuation (。、「」, etc.) into their half-width forms. This is a separate option (and off by default) in order to facilitate normalizing modern Japanese text: Japanese text may contain full-width alphanumeric characters that should be normalized to half-width, while Katakana should remain full-width. (Available in 0.29-SNAPSHOT and later)

<cite>Convert half-width and ASCII characters to full width equivalents</cite> — Select this option to convert all half-width and ASCII characters to their full-width equivalent. For example, the character 'Q' (U+0051) is converted to 'Ｑ' (U+FF31) and the character 'ｻ' (U+FF7B) is converted to 'サ' (U+30B5).

<cite>Convert only the ASCII characters</cite> — Set this option to convert only the ASCII characters to full-width. When this option is set only ASCII characters are affected, half-width chracaters are left half-width.

<cite>Convert only Japanese Katakana and associated punctuation</cite> — Set this option to convert Japanese Katakana and associated punctuation (｡､｢｣, etc.) into their full-width forms. This is a separate option in order to facilitate normalizing modern Japanese text: Japanese text may contain half-width Katakana that should be converted to full-width, while alphanumeric characters should remain half-width. (Available in 0.29-SNAPSHOT and later)

<cite>Normalize output</cite> — Apply Unicode NFC normalization to the output text (if any conversions are made). Converting half-width forms to full-width can result in decomposed forms, for instance ﾌﾟ (U+FF8C U+FF9F) → プ (U+30D5 U+309A). Normalization ensures that the standard representation is used: プ (U+30D7). (Available in 0.29-SNAPSHOT and later)

==Limitations==

None known.

[[Category:Steps]]

Open Standards

2015-09-11T12:53:09Z

Amake: /* GMX - Global information management Metrics eXchange */ Update to GMX-V 2.0

__TOC__
The localization and translation industry uses several standards to exchange data between tools. It is very important for tools to support such standards.

* They avoid your data to be locked into proprietary formats.
* Using standards also allows you to approach the translation process with a broader choice of options and more flexibility.

The applications and components of the Okapi Framework support standards when possible.

==XLIFF - XML Localisation Interchange File Format==

Maintained by the XLIFF Technical Committee at OASIS, XLIFF provides a common markup language for extracted localizable text.

* [http://docs.oasis-open.org/xliff/xliff-core/xliff-core.html XLIFF 1.2 specification]
* [http://www.oasis-open.org/committees/xliff/ The OASIS XLIFF Technical Committee home page]
* [[XLIFF|An overview of XLIFF]]

Many components of the framework use XLIFF. The framework also includes an [[XLIFF Filter]].

==TMX - Translation Memory eXchange==

TMX covers the exchange of translation memory data.

TMX was originally maintained by the OSCAR Committee at LISA. In March 2011 LISA was closed. The OSCAR standards have been put under Creative Commons license and the specifications moved to new hosts.

* [http://www.gala-global.org/oscarStandards/tmx/tmx14b.html TMX 1.4b specification]
* [[TMX|An overview of TMX]]

Many components of the framework use TMX. The framework also includes a [[TMX Filter]].

==SRX - Segmentation Rules eXchange==

SRX addresses the exchange of segmentation rules between tools. The version 1.0 of SRX has been implemented different ways by different tools and has limited usage for exchange. The version 2.0 of SRX has been implemented with better consistency.

SRX was originally maintained by the OSCAR Committee at LISA. In March 2011 LISA was closed. The OSCAR standards have been put under Creative Commons license and the specifications moved to new hosts.

* [http://www.gala-global.org/oscarStandards/srx/srx20.html SRX 2.0 specification]
* [[SRX|An overview of SRX]]

The segmentation engine provided in the framework implements SRX 2.0. You can see it in action in [[Ratel|Ratel, the framework's editor to create and maintain SRX documents]].

==TBX - Term Base eXchange==

TBX is designed to allow the exchange of terminology databases between tools. TBX the same as '''ISO 30042'''. Because TBX is quite complex, its adoption has been slow and OSCAR has come up with '''TBX-Basic''', a sub-set of the more general TBX.

TBX was originally maintained by the OSCAR Committee at LISA. In March 2011 LISA was closed. The OSCAR standards have been put under Creative Commons license and the specifications moved to new hosts.

* [http://www.gala-global.org/oscarStandards/tbx/tbx_oscar.pdf TBX Specification]
* [http://www.gala-global.org/oscarStandards/tbx/tbx-basic.html Information for TBX-Basic]

The [[Quality Check Step]], which is also used in [[CheckMate]], supports TBX as one of its glossary formats.

==ITS - Internationalization Tag Set==

ITS is a W3C namespace that provides internationalization information and support in XML documents.

* [http://www.w3.org/TR/its/ ITS 1.0 specification]
* [http://www.w3.org/International/its/ig/ The W3C ITS Interest Group home page]
* [http://www.w3.org/International/its/ig/simple-example.html Examples of XML documents with ITS markup]
* [[ITS|An overview of ITS]]

Several components of the framework support and use ITS. See the [[ITS Components]] page for details.

Related to ITS, the [http://www.w3.org/TR/xml-i18n-bp/ Best Practices for XML Internationalization W3C Note] can help you designing and authoring XML documents in a way they are easier to localize.

==GMX - Global information management Metrics eXchange==

GMX is a family of standards of globalization and localization-related metrics. The three components of GMX are:

* Volume (V) Global Information Management Metrics Volume addresses the issue of quantifying the workload for a given localization or translation task. GMX-V provides a standard and more precise definition of the statistics necessary for to assess the quantity of text (and costs) associated with language-related globalization tasks.
* Complexity (C) (proposed). GMX-C will provide a standard metric for the assessment of textual complexity with regard to globalization tasks. This format has not yet been defined.
* Quality (Q) (proposed). GMX-Q will provide a standard format for the specification of quality requirements for globalization tasks, thus allowing quality expectations to be specified in contracts and other agreements and verified. This format has not yet been defined.

GMX was originally maintained by the OSCAR Committee at LISA. In March 2011 LISA was closed. The OSCAR standards have been put under Creative Commons license and the specifications moved to new hosts.

* [http://www.xtm-intl.com/manuals/gmx-v/GMX-V-2.0.html GMX-V 2.0 specification]

Steps such as the [[Word Count Step]], [[Character Count Step]], and the [[Scoping Report Step]] provided in the framework use GMX-V 2.0.

==OAXAL - Open Architecture for XML Authoring and Localization==

Maintain by OASIS, OAXAL is a reference architecture that describes a processing model for authoring and localizing XML documents using open standards.

* [http://www.oasis-open.org/committees/download.php/35736/OASIS%20Open%20Architecture%20for%20XML%20Authoring%20and%20Localization%20Reference%20Model%20%28OAXAL%29.pdf OAXAL 1.0 specification]

===OAXAL 1.0 Conformance Statement===

This statement confirms that the Okapi Framework is an OAXAL 1.0 Level 2 compliant application as per the [http://wiki.oasis-open.org/oaxal/#A4ConformanceGuidelines OAXAL Reference Architecture 1.0 Specification conformance requirements], implementing the following constituent standards:

* W3C ITS 1.0
* OASIS XLIFF 1.2
* LISA TMX 1.4b
* LISA SRX 2.0
* LISA TBX 1.0
* LISA GMX/V 1.0

[[Category:GMX]] [[Category:ITS]] [[Category:SRX]] [[Category:TMX]] [[Category:XLIFF]]

Character Count Step

2015-09-11T12:49:33Z

Amake: Add GMX category

{{Steps Header}}
__TOC__
==Overview==

This step performs character counts on the different parts of a set of documents.

Takes: Filter events. Sends: Filter events.

The character counts are saved as annotations that can be used in other steps. For example, the [[Scoping Report Step]] can generate a report from the character counts.

The character count generated follows the [http://www.xtm-intl.com/manuals/gmx-v/GMX-V-2.0.html GMX-V 2.0 standard].

==Parameters==

A character count annotation is always set to each text unit. In addition you can also set annotation for other resources:

<cite>Batches</cite> — Set this option to get an annotation of the character count per batch.

<cite>Batch items</cite> — Set this option to get an annotation of the character count per batch item.

<cite>Documents</cite> — Set this option to get an annotation of the character count per document.

<cite>Sub-documents</cite> — Set this option to get an annotation of the character count per sub-document.

<cite>Groups</cite> — Set this option to get an annotation of the character count per group.

==Limitations==

None known.

[[Category:Steps]] [[Category:GMX]]

Word Count Step

2015-09-11T12:49:17Z

Amake: Add GMX category

{{Steps Header}}
__TOC__
==Overview==

This step performs word counts on the different parts of a set of documents.

Takes: Filter events. Sends: Filter events.

The word counts are saved as annotations that can be used in other steps. For example, the [[Scoping Report Step]] can generate a report from the word counts.

The word count generated follows the [http://www.xtm-intl.com/manuals/gmx-v/GMX-V-2.0.html GMX-V 2.0 standard].

==Parameters==

A word count annotation is always set to each text unit. In addition you can also set annotation for other resources:

<cite>Batches</cite> — Set this option to get an annotation of the word count per batch.

<cite>Batch items</cite> — Set this option to get an annotation of the word count per batch item.

<cite>Documents</cite> — Set this option to get an annotation of the word count per document.

<cite>Sub-documents</cite> — Set this option to get an annotation of the word count per sub-document.

<cite>Groups</cite> — Set this option to get an annotation of the word count per group.

==Limitations==

None known.

[[Category:Steps]] [[Category:GMX]]

Scoping Report Step

2015-09-11T12:47:04Z

Amake: Document character count fields

{{Steps Header}}
__TOC__
==Overview==

This step creates a template-based report on various counts (word count, character count, etc.) and optionally leveraged data.

Takes: Filter events. Sends: Filter events.

In order to have leveraging statistics with this step, your pipeline needs to include, prior this step, one or more steps that leverage translations, such as the [[Leveraging Step]]. Some filters, such as the [[XLIFF Filter]] may also generate resources with leveraged data. For just generating word- or character-count annotations, without report, use the [[Word Count Step]] or [[Character Count Step]].

For a list of the types of matches possible in the counts, see the "[[Match Types]]" page.

==Parameters==

<cite>Project name</cite> — Enter the name that is placed in the title of the report.

<cite>Custom template</cite> — Enter URI or the full path of the custom template to be used to generate the report. If the custom template filed is left empty, or if the specified URI is not found, the default template is used.

<cite>Output path</cite> — Enter the full path of the report file to generate. You can use the <code>${rootDir}</code> variable, as well as any of the [[Template:Locales Variables|source or target locale variables]] (<code>${srcLoc}</code>, <code>${trgloc}</code>, etc).

==Templates==

Templates are used by the Scoping Report Step to generate reports looking exactly the way you would like them to. Currently plain text and HTML formats are supported in templates.
The Scoping Report Step includes a default HTML report, that displays general information about the project and its items. You can specify your own custom report with the step parameter <cite>Custom template</cite>.

Templates contain text and report fields. Report fields are enclosed in brackets. Table rows are enclosed in brackets around a row of column fields. A template can look like this:

<pre>
Project Name: [PROJECT_NAME]
Creation Date: [PROJECT_DATE]
Target Locale: [PROJECT_TARGET_LOCALE]

File,Exact Previous Version Matches,Exact Local Context Matches,100% Matches,Fuzzy Matches,Repetitions,Total,
[[ITEM_NAME],[ITEM_EXACT_PREVIOUS_VERSION],[ITEM_EXACT_LOCAL_CONTEXT],[ITEM_EXACT],[ITEM_FUZZY],[ITEM_GMX_REPETITION_MATCHED_WORD_COUNT],[ITEM_TOTAL_WORD_COUNT],]
Total,[PROJECT_EXACT_PREVIOUS_VERSION],[PROJECT_EXACT_LOCAL_CONTEXT],[PROJECT_EXACT],[PROJECT_FUZZY],[PROJECT_GMX_REPETITION_MATCHED_WORD_COUNT],[PROJECT_TOTAL_WORD_COUNT]
</pre>

This template will produce something similar to this:

<pre>
Project Name: Community website
Creation Date: 17.03.2011 23:21:23 CET
Target Locale: fr-ca

File,Exact Previous Version Matches,Exact Local Context Matches,100% Matches,Fuzzy Matches,Repetitions,Total,
D:\SVN\OKAPI\steps\scopingreport\target\test-classes\net\sf\okapi\steps\scopingreport\aa324.html,10,23,12,57,132,23,
D:\SVN\OKAPI\steps\scopingreport\target\test-classes\net\sf\okapi\steps\scopingreport\form.html,31,22,13,17,19,17,
D:\SVN\OKAPI\steps\scopingreport\target\test-classes\net\sf\okapi\steps\scopingreport\W3CHTMHLTest1.html,10,23,12,57,12,54,
Total,210,323,512,357,312,154
</pre>

==Report fields==

Templates should contain placeholders for calculable report data. Those placeholders are called report fields and are filled up automatically by the Scoping Report Step.

Please note, that calculation of most of the fields' values is performed by separate steps, e.g. [[Word Count Step]], [[Character Count Step]], or [[Leveraging Step]]. The Scoping Report Step generally speaking is a presentation layer, displaying information provided by other steps. So if you forget to include a required step in your pipeline, you will see zeros in the generated report.

Report fields can contain word or character counts for the entire project or an individual item in the project. Report fields for those count types are respectively prefixed with <code>REPORT_</code> and <code>ITEM_</code> respectively.

The tables below show how report fields are related to count categories, and list example steps that provide information for related word or character counts.

====General project fields====

{| border="1" cellpadding="5" cellspacing="0"
|+
| '''Report field''' || '''Example of provider''' || '''Description'''
|- valign="top"
| PROJECT_NAME || || Name of the project as set in the step parameters.
|- valign="top"
| PROJECT_DATE || || Date and time when the report was generated.
|- valign="top"
| PROJECT_SOURCE_LOCALE || || Source locale, obtained automatically.
|- valign="top"
| PROJECT_TARGET_LOCALE || || Target locale, obtained automatically.
|- valign="top"
| PROJECT_TOTAL_WORD_COUNT || Word Count Step || Total number of words, both translatable and non-translatable, in all items of the project.
|- valign="top"
| PROJECT_TOTAL_CHARACTER_COUNT || Character Count Step || Total number of characters, excluding whitespace and punctuation, both translatable, and non-translatable in all items of the project.
|- valign="top"
| PROJECT_WHITESPACE_CHARACTER_COUNT || Character Count Step || Total number of whitespace characters, both translatable and non-translatable, in all items of the project.
|- valign="top"
| PROJECT_PUNCTUATION_CHARACTER_COUNT || Character Count Step || Total number of punctuation characters, both translatable and non-translatable, in all items of the project.
|- valign="top"
| PROJECT_OVERALL_CHARACTER_COUNT || Character Count Step || Total number of characters, including whitespace and punctuation, both translatable and non-translatable, in all items of the project.
|}

====General item fields====

{| border="1" cellpadding="5" cellspacing="0"
|+
| '''Report field''' || '''Example of provider''' || '''Description'''
|- valign="top"
| ITEM_NAME || || Name of the item (full file name).
|- valign="top"
| ITEM_SOURCE_LOCALE || || Source locale, obtained automatically.
|- valign="top"
| ITEM_TARGET_LOCALE || || Target locale, obtained automatically.
|- valign="top"
| ITEM_TOTAL_WORD_COUNT || Word Count Step || Total number of words, both translatable and non-translatable, in the current item.
|- valign="top"
| ITEM_TOTAL_CHARACTER_COUNT || Character Count Step || Total number of characters, excluding whitespace and punctuation, both translatable and non-translatable, in the current item.
|- valign="top"
| ITEM_WHITESPACE_CHARACTER_COUNT || Character Count Step || Total number of whitespace characters, both translatable and non-translatable, in the current item.
|- valign="top"
| ITEM_PUNCTUATION_CHARACTER_COUNT || Character Count Step || Total number of punctuation characters, both translatable and non-translatable, in the current item.
|- valign="top"
| ITEM_OVERALL_CHARACTER_COUNT || Character Count Step || Total number of characters, including whitespace and punctuation, both translatable and non-translatable, in the current item.
|}

====Project fields for Okapi count categories====

{| border="1" cellpadding="5" cellspacing="0"
|+
| '''Report field''' || '''Example of provider''' || '''Okapi word count category''' || '''Description'''
|- valign="top"
| PROJECT_EXACT_UNIQUE_ID || Leveraging Step || EXACT_UNIQUE_ID || Matches EXACT and matches a unique id.
|- valign="top"
| PROJECT_EXACT_PREVIOUS_VERSION || Leveraging Step || EXACT_PREVIOUS_VERSION || Matches EXACT and comes from the preceding version of the same document (i.e., if v4 is leveraged this match must come from v3, not v2 or v1!!).
|- valign="top"
| PROJECT_EXACT_LOCAL_CONTEXT || Leveraging Step || EXACT_LOCAL_CONTEXT || Matches EXACT and a small number of segments before and/or after.
|- valign="top"
| PROJECT_EXACT_DOCUMENT_CONTEXT || Repetition Analysis Step || EXACT_DOCUMENT_CONTEXT || Matches EXACT and comes from the same document.
|- valign="top"
| PROJECT_EXACT_STRUCTURAL || Leveraging Step || EXACT_STRUCTURAL || Matches EXACT and the structural type of the segment (title, paragraph, list element etc.)
|- valign="top"
| PROJECT_EXACT || Leveraging Step || EXACT || Matches text and codes exactly.
|- valign="top"
| PROJECT_EXACT_TEXT_ONLY_UNIQUE_ID || Leveraging Step || EXACT_TEXT_ONLY_UNIQUE_ID || Matches EXACT_TEXT_ONLY and matches a unique id.
|- valign="top"
| PROJECT_EXACT_TEXT_ONLY_PREVIOUS_VERSION || Leveraging Step || EXACT_TEXT_ONLY_PREVIOUS_VERSION || Matches EXACT_TEXT_ONLY and comes from a previous version of the same document.
|- valign="top"
| PROJECT_EXACT_TEXT_ONLY || Leveraging Step || EXACT_TEXT_ONLY || Matches text exactly, but there is a difference in one or more codes.
|- valign="top"
| PROJECT_EXACT_REPAIRED || Leveraging Step || EXACT_REPAIRED || Matches text and codes exactly, but only after the result of some automated repair (e.g. number replacement, code repair, capitalization, punctuation etc.)
|- valign="top"
| PROJECT_FUZZY_UNIQUE_ID || Leveraging Step || FUZZY_UNIQUE_ID || Matches FUZZY and matches a unique id.
|- valign="top"
| PROJECT_FUZZY_PREVIOUS_VERSION || Leveraging Step || FUZZY_PREVIOUS_VERSION || Matches FUZZY and comes from a previous version of the same document.
|- valign="top"
| PROJECT_FUZZY || Leveraging Step || FUZZY || Matches both text and/or codes partially.
|- valign="top"
| PROJECT_FUZZY_REPAIRED || Leveraging Step || FUZZY_REPAIRED || Matches both text and/or codes partially and some automated repair (e.g. number replacement, code repair, capitalization, punctuation etc..) was applied to the target.
|- valign="top"
| PROJECT_PHRASE_ASSEMBLED || - || PHRASE_ASSEMBLED || Matches assembled from phrases in the TM or other resources (different algorithms could be used).
|- valign="top"
| PROJECT_MT || Leveraging Step || MT || Indicates a translation coming from an MT engine.
|- valign="top"
| PROJECT_CONCORDANCE || - || CONCORDANCE || TM concordance or phrase match (usually a word or term only)
|- valign="top"
| PROJECT_NOCATEGORY || || n/a || Does not match any of the Okapi word count categories. This field is calculated by subtracting the sum of all words in all categories above from the total word count.
|- valign="top"
| PROJECT_NONTRANSLATABLE_WORD_COUNT || Word Count Step || n/a || Number of words that match either of non-translatable Okapi word count categories.
|- valign="top"
| PROJECT_TRANSLATABLE_WORD_COUNT || Word Count Step || n/a || Number of words that match neither of non-translatable Okapi word count categories, and thus need translation.
|}

Character count categories are also available; replace <code>WORD</code> with <code>CHARACTER</code> or add the suffix <code>_CHARACTER</code> to the fields above to yield the character equivalent. Character counts exclude whitespace and punctuation characters.

====Item fields for Okapi count categories====

{| border="1" cellpadding="5" cellspacing="0"
|+
| '''Report field''' || '''Example of provider''' || '''Okapi word count category''' || '''Description'''
|- valign="top"
| ITEM_EXACT_UNIQUE_ID || Leveraging Step || EXACT_UNIQUE_ID || Matches EXACT and matches a unique id.
|- valign="top"
| ITEM_EXACT_PREVIOUS_VERSION || Leveraging Step || EXACT_PREVIOUS_VERSION || Matches EXACT and comes from the preceding version of the same document (i.e., if v4 is leveraged this match must come from v3, not v2 or v1!!).
|- valign="top"
| ITEM_EXACT_LOCAL_CONTEXT || Leveraging Step || EXACT_LOCAL_CONTEXT || Matches EXACT and a small number of segments before and/or after.
|- valign="top"
| ITEM_EXACT_DOCUMENT_CONTEXT || Repetition Analysis Step || EXACT_DOCUMENT_CONTEXT || Matches EXACT and comes from the same document.
|- valign="top"
| ITEM_EXACT_STRUCTURAL || Leveraging Step || EXACT_STRUCTURAL || Matches EXACT and the structural type of the segment (title, paragraph, list element etc.)
|- valign="top"
| ITEM_EXACT || Leveraging Step || EXACT || Matches text and codes exactly.
|- valign="top"
| ITEM_EXACT_TEXT_ONLY_UNIQUE_ID || Leveraging Step || EXACT_TEXT_ONLY_UNIQUE_ID || Matches EXACT_TEXT_ONLY and matches a unique id.
|- valign="top"
| ITEM_EXACT_TEXT_ONLY_PREVIOUS_VERSION || Leveraging Step || EXACT_TEXT_ONLY_PREVIOUS_VERSION || Matches EXACT_TEXT_ONLY and comes from a previous version of the same document.
|- valign="top"
| ITEM_EXACT_TEXT_ONLY || Leveraging Step || EXACT_TEXT_ONLY || Matches text exactly, but there is a difference in one or more codes.
|- valign="top"
| ITEM_EXACT_REPAIRED || Leveraging Step || EXACT_REPAIRED || Matches text and codes exactly, but only after the result of some automated repair (e.g. number replacement, code repair, capitalization, punctuation etc.)
|- valign="top"
| ITEM_FUZZY_UNIQUE_ID || Leveraging Step || FUZZY_UNIQUE_ID || Matches FUZZY and matches a unique id.
|- valign="top"
| ITEM_FUZZY_PREVIOUS_VERSION || Leveraging Step || FUZZY_PREVIOUS_VERSION || Matches FUZZY and comes from a previous version of the same document.
|- valign="top"
| ITEM_FUZZY || Leveraging Step || FUZZY || Matches both text and/or codes partially.
|- valign="top"
| ITEM_FUZZY_REPAIRED || Leveraging Step || FUZZY_REPAIRED || Matches both text and/or codes partially and some automated repair (e.g. number replacement, code repair, capitalization, punctuation etc..) was applied to the target.
|- valign="top"
| ITEM_PHRASE_ASSEMBLED || - || PHRASE_ASSEMBLED || Matches assembled from phrases in the TM or other resources (different algorithms could be used).
|- valign="top"
| ITEM_MT || Leveraging Step || MT || Indicates a translation coming from an MT engine.
|- valign="top"
| ITEM_CONCORDANCE || - || CONCORDANCE || TM concordance or phrase match (usually a word or term only)
|- valign="top"
| ITEM_NOCATEGORY || || n/a || Does not match any of the Okapi word count categories. This field is calculated by subtracting the sum of all words in all categories above from the total word count.
|- valign="top"
| ITEM_NONTRANSLATABLE_WORD_COUNT || Word Count Step || n/a || Number of words that match either of non-translatable Okapi word count categories.
|- valign="top"
| ITEM_TRANSLATABLE_WORD_COUNT || Word Count Step || n/a || Number of words that match neither of non-translatable Okapi word count categories, and thus need translation.
|}

Character count categories are also available; replace <code>WORD</code> with <code>CHARACTER</code> or add the suffix <code>_CHARACTER</code> to the fields above to yield the character equivalent. Character counts exclude whitespace and punctuation characters.

====Project fields for GMX count categories====

{| border="1" cellpadding="5" cellspacing="0"
|+
| '''Report field''' || '''Example of provider''' || '''GMX word count category''' || '''Description'''
|- valign="top"
| PROJECT_GMX_PROTECTED_WORD_COUNT || || ProtectedWordCount || An accumulation of the word count for text that has been marked as 'protected', or otherwise not translatable (XLIFF text enclosed in <code><mrk mtype="protected"></code> elements).
|- valign="top"
| PROJECT_GMX_EXACT_MATCHED_WORD_COUNT || Leveraging Step || ExactMatchedWordCount || An accumulation of the word count for text units that have been matched unambiguously with a prior translation and thus require no translator input.
|- valign="top"
| PROJECT_GMX_LEVERAGED_MATCHED_WORD_COUNT || Leveraging Step || LeveragedMatchedWordCount || An accumulation of the word count for text units that have been matched against a leveraged translation memory database.
|- valign="top"
| PROJECT_GMX_REPETITION_MATCHED_WORD_COUNT || Repetition Analysis Step || RepetitionMatchedWordCount || An accumulation of the word count for repeating text units that have not been matched in any other form. Repetition matching is deemed to take precedence over fuzzy matching.
|- valign="top"
| PROJECT_GMX_FUZZY_MATCHED_WORD_COUNT || Leveraging Step || FuzzyMatchedWordCount || An accumulation of the word count for text units that have been fuzzy matched against a leveraged translation memory database.
|- valign="top"
| PROJECT_GMX_ALPHANUMERIC_ONLY_TEXT_UNIT_WORD_COUNT || || AlphanumericOnlyTextUnitWordCount || An accumulation of the word count for text units that have been identified as containing only alphanumeric words.
|- valign="top"
| PROJECT_GMX_NUMERIC_ONLY_TEXT_UNIT_WORD_COUNT || || NumericOnlyTextUnitWordCount || An accumulation of the word count for text units that have been identified as containing only numeric words.
|- valign="top"
| PROJECT_GMX_MEASUREMENT_ONLY_TEXT_UNIT_WORD_COUNT || || MeasurementOnlyTextUnitWordCount || An accumulation of the word count from measurement-only text units.
|- valign="top"
| PROJECT_GMX_NOCATEGORY || || n/a || Does not match any of the GMX word count categories. This field is calculated by subtracting the sum of all words in all categories above from the total word count.
|- valign="top"
| PROJECT_GMX_NONTRANSLATABLE_WORD_COUNT || Word Count Step || n/a || Number of words that match either of non-translatable GMX word count categories.
|- valign="top"
| PROJECT_GMX_TRANSLATABLE_WORD_COUNT || Word Count Step || n/a || Number of words that match neither of non-translatable GMX word count categories, and thus need translation.
|}

Character count categories are also available; replace <code>WORD</code> with <code>CHARACTER</code> or add the suffix <code>_CHARACTER</code> to the fields above to yield the character equivalent. Character counts exclude whitespace and punctuation characters.

====Item fields for GMX count categories====

{| border="1" cellpadding="5" cellspacing="0"
|+
| '''Report field''' || '''Example of provider''' || '''GMX word count category''' || '''Description'''
|- valign="top"
| ITEM_GMX_PROTECTED_WORD_COUNT || || ProtectedWordCount || An accumulation of the word count for text that has been marked as 'protected', or otherwise not translatable (XLIFF text enclosed in <mrk mtype="protected"> elements).
|- valign="top"
| ITEM_GMX_EXACT_MATCHED_WORD_COUNT || Leveraging Step || ExactMatchedWordCount || An accumulation of the word count for text units that have been matched unambiguously with a prior translation and thus require no translator input.
|- valign="top"
| ITEM_GMX_LEVERAGED_MATCHED_WORD_COUNT || Leveraging Step || LeveragedMatchedWordCount || An accumulation of the word count for text units that have been matched against a leveraged translation memory database.
|- valign="top"
| ITEM_GMX_REPETITION_MATCHED_WORD_COUNT || Repetition Analysis Step || RepetitionMatchedWordCount || An accumulation of the word count for repeating text units that have not been matched in any other form. Repetition matching is deemed to take precedence over fuzzy matching.
|- valign="top"
| ITEM_GMX_FUZZY_MATCHED_WORD_COUNT || Leveraging Step || FuzzyMatchedWordCount || An accumulation of the word count for text units that have been fuzzy matched against a leveraged translation memory database.
|- valign="top"
| ITEM_GMX_ALPHANUMERIC_ONLY_TEXT_UNIT_WORD_COUNT || || AlphanumericOnlyTextUnitWordCount || An accumulation of the word count for text units that have been identified as containing only alphanumeric words.
|- valign="top"
| ITEM_GMX_NUMERIC_ONLY_TEXT_UNIT_WORD_COUNT || || NumericOnlyTextUnitWordCount || An accumulation of the word count for text units that have been identified as containing only numeric words.
|- valign="top"
| ITEM_GMX_MEASUREMENT_ONLY_TEXT_UNIT_WORD_COUNT || || MeasurementOnlyTextUnitWordCount || An accumulation of the word count from measurement-only text units.
|- valign="top"
| ITEM_GMX_NOCATEGORY || || n/a || Does not match any of the GMX word count categories. This field is calculated by subtracting the sum of all words in all categories above from the total word count.
|- valign="top"
| ITEM_GMX_NONTRANSLATABLE_WORD_COUNT || Word Count Step || n/a || Number of words that match either of non-translatable GMX word count categories.
|- valign="top"
| ITEM_GMX_TRANSLATABLE_WORD_COUNT || Word Count Step || n/a || Number of words that match neither of non-translatable GMX word count categories, and thus need translation.
|}

Character count categories are also available; replace <code>WORD</code> with <code>CHARACTER</code> or add the suffix <code>_CHARACTER</code> to the fields above to yield the character equivalent. Character counts exclude whitespace and punctuation characters.

==Limitations==

None known.

[[Category:Steps]] [[Category:GMX]]

Table Filter

2015-05-21T02:00:47Z

Amake: /* CSV Actions */ Document "add qualifiers" option

{{Filters Header}}
==Overview==

The Table Filter is an Okapi component that implements the IFilter interface for plain text documents. The filter is implemented in the class <code>net.sf.okapi.filters.table.TableFilter</code> of the library.

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input file using the following logic:

* If the file has a Unicode Byte-Order-Mark:
** Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
* Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.

===Output Encoding===

If the output encoding is UTF-8:

* If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
* If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

===Line-Breaks===

The type of line-breaks of the output is the same as the one of the original input.

==Parameters==

===Table Tab===

====Table Type====

<cite>CSV</cite> — Select this option to work with formats where the columns are separated by a single character such as a comma, a semi-colon, a tab, etc.

<cite>TSV</cite> — Select this option to work with formats where the columns are separated by one or more tabs (i.e. two consecutive tabs do not mark an empty column). Note that for formats where the column separator is a single tab you should select <cite>CSV</cite> with a tab as the separator.

<cite>Fixed-width columns</cite> — Select this option to work with formats where each column has a fixed width.

====Table Properties====

When the table file contains a header with column names and optionally other info, you can specify which line contains column names, and from which line the actual table data are starting.

<cite>Values start at line</cite> — Specify the line number of the first table row (default 1, the data start from the beginning of the file, no header presents).

<cite>Line with column names</cite> — Specify the number of the line containing column names (default 0, i.e. no line with column names).

Lines are numbered from 1. The default settings describe a table without a header with column names, data start from the beginning of the file, the above table properties will be 1 and 0.

If you have a table which 1-st line contains column names, and consecutive lines (from 2 on) contain table data like in most CSV files, then specify 2 and 1 for the properties.

====CSV Options====

<cite>Field delimiter</cite> — Character separating fields in a row. Default is comma (,).

<cite>Text qualifier</cite> — Character before and after field value to allow field delimiters inside the field. For instance, this field will not be broken into parts though comma is a field delimiter: ["Field, containing comma"]. Default is the quotation mark (").

====CSV Escaping Mode====

If a field contains the active text qualifier (e.g. quotation mark), then all occurrences of that qualifier should be escaped. For instance, ["Text, ""quoted text"""] or ["Text, \"quoted text\""].

<cite>Duplicate qualifier</cite> — Escaping is performed by duplication of the active qualifier set in the CSV options group, e.g. ["Text, ""quoted text"""].

<cite>Backslash</cite> — Escaping is performed by prefixing all occurrences of the active qualifier with the backslash character (\), e.g. ["Text, \"quoted text\""].

====CSV Actions====

<cite>Exclude qualifiers from extracted text</cite> — If selected, qualifiers are removed from the text and go to the TU skeleton.

<cite>Exclude leading/trailing white spaces from extracted text</cite> — if selected, then trimming of leading/trailing white spaces is performed based on the trimming mode:

* Only entries without qualifiers — only non-qualified field values are trimmed, leading and trailing spaces remain in qualified fields (e.g. [" text "] becomes [ text ], and [ non-qualified ] becomes [non-qualified] ).

* All — both non-qualified and qualified field values are trimmed of leading and trailing spaces (e.g. [" text "] becomes [text], and [ non-qualified ] becomes [non-qualified] ).

<cite>Add qualifiers to output when appropriate</cite> — If selected, upon output qualifiers will be added (if not already present) to any value that contains a field delimiter or line break as part of its textual content.

====Extraction Mode====

If the table contains a header (i.e. one or more lines in the beginning of the file, containing description of the data, names of fields etc.), you can specify whether you want to extract the header data and/or data from the table body.

<cite>Extract header lines</cite> — When selected, you can choose among these options:

* Column names only — only column names will be sent as separate TextUnits, one for every column name.

* All — all header lines will be sent as TUs (the column names line will be sent as a series of TUs for every column name, other lines will be sent as one TU for every line).

<cite>Extract table data</cite> — When selected, TUs will be created for the table data (values in the table body), one TU for every row/column value.

===Columns Tab===

====Extraction Mode====

Extraction mode directs the filter in what columns contain translatable text to be extracted and placed on text units. Text in the columns not containing translatable text will be placed in a skeleton.

<cite>Extract from all columns</cite> — All columns contain translatable text.

<cite>Extract by column definitions</cite> — The filter detects the translatable text based on column definitions provided in the Column definitions table (see below).

====Number of Columns====

This group tells the filter how to detect the number of columns in a table.

<cite>Defined by values</cite> — Number of columns is detected for every individual row, not for the whole table. If different rows contain different number of values, then different number of TUs will be sent for different rows

<cite>Defined by column names</cite> — Number of columns in the table is determined by the number of column names. If the number of actual values in a row exceeds the number of column names, values in extra columns are dropped. If some expected data are missing in some rows, empty TUs are created for the missing columns data.

<cite>Fixed number of columns</cite> — Number of columns is explicitly specified by the spinner value (1-100, default 2). Extra columns are dropped, empty TUs are created for missing columns.

====Column Definitions====

You can add or modify definitions for columns of your table.

Every column has a 1-based index and a type:

* Source — the column contains text in a source language.

* Source ID — the column provides a unique ID for a source column. This ID becomes the name of the created text unit resource.

* Target — the column contains text in target language for a given source column.

* Comment — the column contains a comment for a specified source column.

* Record ID — the column provides an ID for the current record (row).

Every row in the table (can be multi-line by the means of text qualifiers) is considered a record. Every record can have a record ID (e.g. a database table primary key). It is possible not only to have several target columns for one source, but also several source columns in one table. To tell source columns one from another, you can specify an ID suffix. If a given source column doesn't have a source ID attached (in a source ID column), then the filter will append the ID suffix for that source column to the record ID, thus creating a name for the text unit.

===Options Tab===

====Text Unit Processing====

<cite>Allow trimming</cite> — For CSV table type this option works together with <cite>CSV actions</cite> - <cite>Exclude leading/trailing white spaces from extracted text</cite>.

* Trim leading spaces and tabs - if selected, extracted text is trimmed left.

* Trim trailing spaces and tabs - if selected, extracted text is trimmed right.

<cite>Convert \t \n \\ \uXXXX into characters</cite> — If selected, escape sequences are converted to regular characters.

====Multi-Line Text Units====

When extracted text is multi-line, this group controls the way of combining multiple lines in a single text unit:

<cite>Separate lines with line feeds</cite> — multiple lines are extracted like a text run with \n separating the original lines.

<cite>Unwrap lines</cite> — multiple lines are merged in a single text run, a space is inserted in-between the original lines.

<cite>Create inline codes for line breaks</cite> — multiple lines are extracted like a single text run with an inline code containing the original line break and separating the lines.

====Inline Codes====

<cite>Has inline codes as defined below</cite> — Set this option to use the specified regular expressions on the text of the extracted items. Any match will be converted to an inline code.

{{CodeFinder Help}}

==Limitations==

None known.

[[Category:Filters]]

Tikal - Miscellaneous Commands

2014-09-09T07:45:56Z

Amake: /* Output Scoping Report */

{{Tikal Common Menu}}
__TOC__
==Segment Files==

This command applies [[SRX|SRX segmentation rules]] to the input files. If the file format supports segmented output (e.g. [[XLIFF Filter|XLIFF]], [[TTX Filter|TTX]]) the result of the segmentation is written in the output files.

You can use the <code>-seg</code> option to specify that the extracted text should be segmented. Use <code>-seg</code> without file name to use the default segmentation rules, use "<code>-seg myRules.srx</code>" to specify your own rules. The rules file must be in SRX format.

The output files have a <code>.out</code> extension pre-pended to the original extension. For example, if your original file is <code>myFile.html</code>, the translated document should be <code>myFile.out.html</code>.

The syntax of this command is:

-s [options] inputFile [inputFile2...]

Where the options are:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code>-fc configId</code> || The identifier of the filter configuration to use for the extraction.
|- valign="top"
| <code>-ie encoding</code> || The encoding name of the input files. this is used only if the filter cannot detect the encoding from the input file itself.
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language of the input files. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language for the output (also used in the input if the input documents are multilingual). [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
|nowrap="nowrap"| <code>-seg [srxFile]</code> || The segmentation rules to utilize. To specify the default rules that come with the installation, use <code>-seg</code> without filename. The default rules are in <code>config/defaultSegmentation.srx</code> in your Okapi main directory.
|- valign="top"
|nowrap="nowrap"|<code>-rd rootDirectory</code> || The root directory (by default the user's home directory).
|}

For example:

tikal -s myFile.xlf

Creates an output document named <code>myFile.out.xlf</code> from the input document <code>myFile.xlf</code>. The entries in the output have been segmented according the default segmentation rules.

==List Filter Configurations==

This command lists all the filter configurations available for Tikal. The configurations listed are the ones you can use as filter configurations the the input files (<code>-fc</code> option). This configuration indicates how to extract the document.

The syntax of this command is:

-lfc | -listconf

For example:

tikal -listconf

Lists all the filter configurations currently available.

==Edit Filter Configurations==

This command edits or view filter configurations.

{{NoteBox|This command requires access to UI editors that are available only if you have one of the okapi-apps platform-specific distribution. If you run this command from the okapi-lib cross-platform distribution you will get an error. To edit filter configurations in the okapi-lib distribution, open the <code>.fprm</code> files. Make sure to always save your modifications in UTF-8.}}

The syntax of this command is:

-e [[-fc] configId]

For example:

tikal -e okf_regex@myConfig

Edits the filter configuration <code>okf_regex@myConfig</code>. This is a user configuration for the [[Regex Filter]].

tikal -e

Opens the <cite>[[Filter Configurations]]</cite> dialog box, where all the available configurations are listed and can be viewed or edited, and from where you can create new configurations.

==Output Scoping Report==

This command allows you to output a scoping report including word count, matching statistics, etc.

The report will be output to stdout. The content is the same as the [[Scoping Report Step]]'s default template:
* Date
* File list
* Total word count
* If leveraging is used:
** Exact Local Context match word count
** 100% Match word count
** Fuzzy Match word count
** Repetition word count

The syntax of this command is:

-sr [options] inputFile [inputFile2...]

Available options:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code>-fc configId</code> || The identifier of the filter configuration to use for the extraction.
|- valign="top"
| <code>-ie encoding</code> || The encoding name of the input files. this is used only if the filter cannot detect the encoding from the input file itself.
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language of the input files. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language for the output (also used in the input if the input documents are multilingual). [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
|nowrap="nowrap"| <code>-seg [srxFile]</code> || The segmentation rules to utilize. To specify the default rules that come with the installation, use <code>-seg</code> without filename. The default rules are in <code>config/defaultSegmentation.srx</code> in your Okapi main directory.
|- valign="top"
|nowrap="nowrap"| <code>-pen tmDirectory|<br/>-tt hostname[:port]|<br/>-gs configFile|<br/>-mm [key]|<br/>-gg configFile|<br/>-apertium [configFile]|<br/>-ms configFile|<br/>-tda configFile</code>
| A translation resource connector to use to translate the document: <code>-pen</code> for the [[Pensieve TM Connector]], <code>-tt</code> for the [[Translate Toolkit TM Connector]], <code>-gs</code> for the [[GlobalSight TM Connector]], <code>-mm</code> for [[MyMemory TM Connector]], <code>-gg</code> for the [[Google MT v2 Connector]], <code>-apertium</code> for the [[Apertium MT Connector]], <code>-ms</code> for the [[Microsoft Translator Connector]], and <code>-tda</code> for the [[TDA Translation Repository Connector]].
|- valign="top"
| <code>-opt threshold</code> || TM query option: The threshold is a number between 0 and 100. If this option is not set the default is 95. Note that this option may be limited for some search engines because of the way they are configured.
|- valign="top"
| <code>-maketmx [tmxFile]</code> || Generates a TMX document with all the entries leveraged. You can specify the name of the document, if you do not it will be named <code>pretrans.tmx</code>.
|}

[[Category:Tikal]] [[Category:Filters]] [[Category:Segmentation]]

Tikal

2014-09-09T07:30:57Z

Amake: Document Tikal scoping report

[[Image:Tikal1.png|thumb|Tikal on Macintosh]]
{| cellpadding="8"
|- valign="top"
|
[[Image:TikalIcon.png]]
|
Help Topics:
|
* [[Tikal - Usage|Usage]]
* [[Tikal - Extraction Commands#Extract Files|Extract Files]]
* [[Tikal - Extraction Commands#Merge Files|Merge Files]]
* [[Tikal - Extraction Commands#Extract Files to Moses|Extract Files to Moses]]
* [[Tikal - Extraction Commands#Leverage Files from Moses|Leverage Files from Moses]]
* [[Tikal - Translation Commands#Add Translation to a Resource|Add Translation to a Resource]]
|
* [[Tikal - Translation Commands#Translate Files|Translate Files]]
* [[Tikal - Translation Commands#Query Translation Resources|Query Translation Resources]]
* [[Tikal - Conversion Commands#Convert to PO Format|Convert to PO Format]]
* [[Tikal - Conversion Commands#Convert to TMX Format|Convert to TMX Format]]
* [[Tikal - Conversion Commands#Convert to Table Format|Convert to Table Format]]
* [[Tikal - Conversion Commands#Import into Pensieve TM|Import into Pensieve TM]]
|
* [[Tikal - Conversion Commands#Export TMX from Pensieve TM|Export TMX from Pensieve TM]]
* [[Tikal - Miscellaneous Commands#Segment Files|Segment Files]]
* [[Tikal - Miscellaneous Commands#List Filter Configurations|List Filter Configurations]]
* [[Tikal - Miscellaneous Commands#Edit Filter Configurations|Edit Filter Configurations]]
* [[Tikal - Miscellaneous Commands#Output Scoping Report|Output Scoping Report]]
|}

==Overview==

Tikal is a cross-platform command-line tool that performs some simple localization-related tasks, such as:

* Extract and merge [[XLIFF|XLIFF documents]].
* Query different MT and TM systems.
* Perform format conversions.
* Translate directly files in various formats using TM or MT systems.
* Segment a source text.
* And more...

Like other Okapi applications, Tikal uses the components provided by the Okapi libraries, for example [[Connectors|the connectors to different translation resource engines]] such as [[Apertium MT Connector|Apertium MT]], [[Translate Toolkit TM Connector|Translate Toolkit TM]], [[Google MT v2 Connector|Google MT]], [[Microsoft Translator Connector|Microsoft Translator]], [[MyMemory TM Connector|MyMemory TM]], etc.

Tikal also uses the [[Filters|Okapi filters]], allowing you to process files in many different formats, such as: [[HTML Filter|HTML]], [[OpenOffice Filter|ODT]], [[OpenXML Filter|DOCX]], [[PO Filter|PO]], [[XLIFF Filter|XLIFF]], [[TMX Filter|TMX]], [[XML Filter|XML]], and many more. Its extraction and merge functions let you create easily [[XLIFF|XLIFF documents]].

==Download==

Tikal is available from both the okapi-apps and the okapi-lib distributions:

* '''Stable release (okapi-apps): http://bintray.com/okapi/Distribution/Okapi_Applications
* Stable release (okapi-lib): http://bintray.com/okapi/Distribution/Okapi_Lib
* Development release (snapshot): http://okapi.opentag.com/snapshots

[[Category:Tikal]]

Text Modification Step

2013-11-22T06:37:17Z

Amake: /* Parameters */

{{Steps Header}}
__TOC__
==Overview==

This step modifies the content of the text units.

Takes: Filter events. Sends: Filter events.

Text units set as non-translatable are not modified.

Text units with existing translations are modified only if requested.

==Parameters==

<cite>Type of change to perform</cite> — Select a kind of change to apply. Several are available:

* Keep the original text
* Replace letters with Xs and digits with Ns.
* Remove text but keep inline codes.
* Replace selected ASCII characters with Extended Latin characters.
* Replace selected ASCII characters with Cyrillic characters.
* Replace selected ASCII characters with Arabic characters.
* Replace selected ASCII characters with Chinese characters.

Note that the result of the character substitution is not meant have any specific meaning beyond being a set of characters in a given script. This function does not perform a translation, or a transliteration, or any other meaningful linguistic operation.

<cite>Add the following prefix</cite> — Set this option to add a prefix at the start of each text unit. Enter the text of the prefix.

<cite>Add the following suffix</cite> — Set this option to add a suffix at the end of each text unit. Enter the text of the suffix.

<cite>Append the name of the item</cite> — Set this option to add the name of each text unit at the end of its value. If the text unit has no name associated, the extraction ID is added instead.

<cite>Append the extraction ID of the item</cite> — Set this option to add the extraction ID of each text unit at the end of its value. Extraction IDs are filter-specific.

<cite>Marks segments with '[' and ']' delimiters</cite> — Set this option to add delimiters to the segments in each text unit. The delimiters are just around the text (after the prefix if one is added, and before the item name, extraction ID, and suffix if they are added). If the text unit is not segmented, the delimiters are added at the front and back of the full content of the text unit.

<cite>Expand the text</cite> — Set this option to expand the text. If the content is less than 31 characters it is expanded by 50% or at least one character. If it is longer than 30 characters, it is expanded by 100%. Empty strings are not expanded.

<cite>Modify also the items without text</cite> — Set this option to apply the changes also to text units that have no text (i.e. are empty, or contain only white spaces or codes).

<cite>Modify also the items with an existing translation</cite> — Set this option to apply the changes also to text units that already have a translation (e.g. for multilingual files).

==Limitations==

None known.

[[Category:Steps]]

Cleanup Step

2013-11-22T06:00:43Z

Amake: /* Overview */

{{Steps Header}}
__TOC__
==Overview==

This step cleans strings by normalizing quotes, punctuation, etc. ready for further processing.

Takes: Filter events. Sends: Filter events.

By default, all whitespace is normalized before any further processing is performed; all multiple space, tab, etc. characters are replaced with a single instance.

==Parameters==

<cite>Normalize quotation marks</cite> — Set this option to replace all quotation marks with straight double quotes (") and all apostrophes with single straight quotes (').

<cite>Mark segments matching default regular expressions for removal</cite> — This option is not currently used.

<cite>Mark segments matching user defined regular expressions for removal</cite> — Set this option to remove text units that contain text that matches the user defined regular expression.

<cite>Check for corrupt or unexpected characters</cite> — Set this option to detect and remove text units that contain common corrupt character strings.

<cite>Remove unnecessary segments from text unit</cite> — Set this option to remove text units that have been marked for removal or have no target text.

==Limitations==

Does not work with Asian or bi-directional languages.

[[Category:Steps]]

Tikal - Translation Commands

2013-11-13T05:50:11Z

Amake: /* Translate Files */

{{Tikal Common Menu}}
__TOC__
==Translate Files==

This command creates a pre-translated version of the input files. It is basically the same thing as running an [[Tikal - Extraction Commands#Extract Files|Extract Files command]] (with pre-translation) immediately followed by a [[Tikal - Extraction Commands#Merge Files|Merge Files command]].

By default, some extensions are mapped to a specific filter configuration (for example: <code>.docx</code> to <code>okf_openxml</code>, <code>.odt</code> to <code>okf_openoffice</code>, <code>.po</code> to <code>okf_po</code>, etc.). But you can define your own configuration and specify it as well using the <code>-fc</code> option. To get a list of all available filter configurations use the [[Tikal - Miscellaneous Commands#List Filter Configurations|List Filter Configurations command]]. For more details the filters available and their configurations, see each [[Filters|filter's documentation]].

You can use the <code>-seg</code> option to specify that the extracted text should be segmented. Use <code>-seg</code> without file name to use the default segmentation rules, use "<code>-seg myRules.srx</code>" to specify your own rules. The rules file must be in SRX format.

The output files have a <code>.out</code> extension prepended to the original extension. For example, if your original file is <code>myFile.html</code>, the translated document should be <code>myFile.out.html</code>.

The syntax of this command is:

-t [options] inputFile [inputFile2...]

Where the options are:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code>-fc configId</code> || The identifier of the filter configuration to use for the extraction.
|- valign="top"
| <code>-ie encoding</code> || The encoding name of the input files. This is used only if the filter cannot detect the encoding from the input file itself.
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language of the input files. [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language for the output (also used in the input if the input documents are multilingual). [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-seg [srxFile]</code> || The segmentation rules to utilize. To specify the default rules that come with the installation, use <code>-seg</code> without filename. The default rules are in <code>config/defaultSegmentation.srx</code> in your Okapi main directory.
|- valign="top"
|nowrap="nowrap"| <code>-pen tmDirectory|<br/>-tt [hostname[:port]]|<br/>-gs configFile|<br/>-mm [key]|<br/>-gg configFile|<br/>-apertium [configFile]|<br/>-ms configFile|<br/>-tda configFile|<br/>-bi bilingualFile</code>
| A translation resource connector to use to translate the document: <code>-pen</code> for the [[Pensieve TM Connector]], <code>-tt</code> for the [[Translate Toolkit TM Connector]], <code>-gs</code> for the [[GlobalSight TM Connector]], <code>-mm</code> for [[MyMemory TM Connector]], <code>-gg</code> for the [[Google MT v2 Connector]], <code>-apertium</code> for the [[Apertium MT Connector]], <code>-ms</code> for the [[Microsoft Translator Connector]], <code>-tda</code> for the [[TDA Translation Repository Connector]], and <code>-bi</code> for the [[Bilingual File Connector]].

The leveraging occurs after segmentation, if you have specified segmentation rules.

Note that some Internet-based resource may be slow and result in lengthy processing time. Be also aware that some translation resources may not always provide a good handling of inline codes.
|- valign="top"
| <code>-opt threshold</code> || TM query option: The threshold is a number between 0 and 100. If this option is not set the default is 95. Note that this option may be limited for some search engines because of the way they are configured.
|- valign="top"
| <code>-maketmx [tmxFile]</code> || Generates a TMX document with all the entries leveraged. You can specify the name of the document, if you do not it will be named <code>pretrans.tmx</code>.
|- valign="top"
|nowrap="nowrap"|<code>-rd rootDirectory</code> || The root directory (by default the user's home directory).
|}

For example:

tikal -t *.html -sl en -tl eo -apertium

Translate from English to Esperanto all .html files in the current directory, using the default Apertium MT server. No segmentation is used.

==Query Translation Resources==

This command queries one or more translation resources for a given text.

You can query all resources at once. When querying several resources, the results are shown per resource, not sorted by best score as a whole.

The syntax of this command is:

-q "text" [options]

Where the options are:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language (language of the text queried). [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language (language of the requested translation). [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-pen directory</code> || Queries a [[Pensieve TM Connector|Pensieve TM]] stored in a given directory.
|- valign="top"
| <code>-opentran</code> || Queries the [[OpenTran Translation Repository Connector|Open-Tran translation repository]]. This requires Internet access.
|- valign="top"
| <code>-gs configFile</code> || Queries a [[GlobalSight TM Connector|GlobalSight TM]] server. This requires Internet access.
|- valign="top"
| <code>-tt [hostname[:port]]</code> || Queries the specified [[Translate Toolkit TM Connector|Translate Toolkit TM]] server. The server can be local or remote.
|- valign="top"
| <code>-mm [key]</code> || Queries the [[MyMemory TM Connector|MyMemory TM]] with an optional key access (use <code>mmDemo123</code> for demo). They key is for backward compatibility. This requires Internet access.
|- valign="top"
| <code>-gg configFile</code> || Queries the [[Google MT v2 Connector|Google MT paid service]]. This requires Internet access. The <code>-google</code> parameter works also like <code>-gg</code> (it used to invoke the v1 of Google Translate which has been discontinued).
|- valign="top"
| <code>-apertium [configFile]</code> || Queries the specified [[Apertium MT Connector|Apertium MT]] server (local or remote). A default remote server is provided.
|- valign="top"
| <code>-ms configFile</code> || Queries the [[Microsoft Translator Connector|Microsoft Translator service]]. This requires Internet access.
|- valign="top"
| <code>-tda configFile</code> || Queries the [[TDA Translation Repository Connector|TDA translation repository]]. This requires Internet access.
|- valign="top"
| <code>-bi bilingualFile</code> || Queries a [[Bilingual File Connector|bilingual file]].
|- valign="top"
|nowrap="nowrap"| <code>-opt threshold[:maxhits]</code> || TM query options: The threshold is a number between 0 and 100. The maximum number of hits is a number above 0. If this option is not set each TM engine uses its own defaults. If this option is set, all TM engines are set to use the specified options. Note that parameters of some engines may be limited by their server-side configuration.
|}

Note: Because the text of the query cannot be associated with a given file format, there is no support for format-specific inline codes. However, when querying a resource that is inline-code aware, you can use HTML-like tags to replace codes: For example, in "<code><nowiki>Open the <x>window</x><x/>.</nowiki></code>" the tags "<code><x></code>", "<code></x></code>" and "<code><x/></code>" are interpreted as opening, closing and placeholder inline codes, and the query processed as such. When querying resources that are not inline code-aware, the tags are treated as plain text. You can use any well-formed XML syntax, not necessarily an element <code>x</code>.

Examples:

tikal -q "open file" -sl en

Queries the default translation resource ([[OpenTran Translation Repository Connector|Open-Tran]]) for the text "<code>open file</code>" in English. The target language by default is French. Note: You could omit the -sl option if you are running from a English system.

tikal -q "open <x>file</x>" -sl en -pen mytm -opt 60:20

Queries the Pensieve TM located in <code>mytm</code> for the text "<code>open <x>file</x></code>" in English. The target language by default is French. Because Pensieve TM can work with inline codes, the tags "<code><x></code>" and "<code></x></code>" are processed as inline codes. The threshold is set to 60 and the maximum hits is set to 20.

tikal -q "open file" -ms appid.key

Queries the [[Microsoft Translator Connector|Microsoft Translator engine]] for the English text "<code>open file</code>" in French. The file <code>appid.key</code> contains your Microsoft AppId value. Those keys are free, but you need to register to have one. See http://www.bing.com/developers/appids.aspx for details.

tikal -q "open file" -opentran -sl en -tl zu

Queries the Open-Tran translation repository for the English text "<code>open file</code>" in Zulu.

tikal -q "open file" -tt localhost:8080 -sl en -tl af

Queries a local Translate Toolkit TM server located on <code><nowiki>http://localhost:8080</nowiki></code>. The source is English and the requested translation is Afrikaans.

tikal -q "Data type" -tda myTDAInfo.cfg -sl en-us -tl fr=fr

Queries the TDA translation repository to get the French translation of the US English text "<code>Data type</code>". The file <code>myTDAInfo.cfg</code> holds the options and credentials to access the repository.

==Add Translation to a Resource==

This command adds a translation (source and target text) to a given translation resource.

For now this command is implemented only for the [[Microsoft Translator Connector|Microsoft Translator]] resource.

The syntax of this command is:

-a "source text" "target text" [options] -ms configFile

Where the options are:

{| border="1" cellpadding="5" cellspacing="0"
|- valign="top"
| <code>N</code> || The rating to associate with the translation. The value must be between 1 and 10 (included). By default it is set to 6. MT results have generally a rating of 5.
|- valign="top"
| <code>-sl srcLang</code> || The code of the source language (language of the source text). [[Tikal - Usage#Source and Target Languages|See more details...]]
|- valign="top"
| <code>-tl trgLang</code> || The code of the target language (language of the target text). [[Tikal - Usage#Source and Target Languages|See more details...]]
|}

Note: Because the provided text cannot be associated with a given file format, you should use HTML-like tags to replace codes: For example, in "<code><nowiki>Open the <x>window</x><x/>.</nowiki></code>" the tags "<code><x></code>", "<code></x></code>" and "<code><x/></code>" are interpreted as opening, closing and placeholder inline codes, and the query processed as such.

Examples:

tikal -a "Text to add" "Texte à ajouter" -sl en -tl fr -ms myConfig.cfg

Adds the pair "Text to add" + "Texte à ajouter" to the [[Microsoft Translator Connector|Microsoft Translator]]'s repository. The source language is English and the target language is French. The file <code>myConfig.cfg</code> contains the parameters to access the engine.

[[Category:Tikal]] [[Category:Connectors]] [[Category:Filters]]

Bilingual File Connector

2013-11-13T05:48:41Z

Amake: /* Parameters */

{{Connectors Header}}
__TOC__
==Overview==

This connector allows one to directly query any Okapi-supported bilingual file format.

Under the hood it imports the specified file to a temporary [[Pensieve TM]]; all queries are actually serviced by the [[Pensieve TM Connector]].

==Parameters==

<cite>Bilingual file</cite> — The path of the file to use. The file must be a bilingual file such as a TMX, PO, etc.

<cite>Input encoding</cite> — The encoding of the bilingual file. This is only used if the filter cannot auto-detect the encoding. The default depends on your system.

==Limitations==

* Okapi must be able to auto-detect the appropriate file filter for the bilingual file. In practice this means that the file must have the correct extension (e.g. <code>.tmx</code> for TMX files, etc.).

* Custom filter configurations are currently not supported.

[[Category:Connectors]]

Wiki Filter

2013-05-24T07:57:14Z

Amake: /* Parameters */

{{Filters Header}}
==Overview==

The Wiki Filter is an Okapi component for extracting translatable text from wiki markup. Currently the only supported style of markup is [https://www.dokuwiki.org/dokuwiki Dokuwiki].

=== <span class="hi">Header</span> ===
<span class="hi">Paragraph</span>
* <span class="hi">List item</span>
* <span class="hi">List item</span>

<nowiki>{{image.jpg|</nowiki><span class="hi">Image caption</span>}}

^ <span class="hi">Table header 1</span> ^ <span class="hi">Table header 2</span> |
| <span class="hi">Table cell 1</span> | <span class="hi">Table cell 2</span> |

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input document using the following logic:

* If the file has a Unicode Byte-Order-Mark:
** Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
* Otherwise, the input encoding used is the default encoding that was specified when opening the document.

===Inline Codes===
All Dokuwiki syntax described [https://www.dokuwiki.org/syntax here] is supported.

==Parameters==

Prevent the filter from collapsing whitespace by setting <code>preserve_whitespace: true</code>.

You may define custom inline codes as follows:

custom_codes:
- pattern: "'''REGEX_PATTERN'''"
- {start_pattern: "'''REGEX_PATTERN'''", end_pattern: "'''REGEX_PATTERN'''"}

Specify just <code>pattern</code> for a placeholder tag; specify a <code>start_pattern</code> and <code>end_pattern</code> for opening/closing tag pairs.

{| class="wikitable"
!Item
!Description
!Example value
|-
|<code>REGEX_PATTERN</code>
|Any valid regex that matches non-zero-width runs of text within the comment. Matches will be turned into codes according to the parameters as described above.
|<code><nowiki>\[(path|menu)[^\]]*\]</nowiki></code>
|}

==Limitations==

* Attributes of inline codes (link and image URLs, etc.) are not exposed for translation or special processing.
* Embedded HTML, PHP, etc., is not extracted.

[[Category:Filters]]

Template:Steps Header

2013-03-05T08:16:02Z

Amake:

<div style="float: right; border: 1px solid; background: white; margin:1em; margin-top:1px; padding:0.5em">
<div style="padding: 0.2em; padding-right: 0.5em;">
* [[Steps|Steps List and Overview]]
* [[Knowledge Base#Pipelines and Steps|Articles on Pipelines and Steps in the Knowledge Base]]
* [[Rainbow - Utilities|All Rainbow Utilities]]
</div>
</div>

Rainbow - Main Window

2013-01-31T03:18:30Z

Amake: /* Root */

{{Rainbow Common Menu}}
__TOC__
===Input List Tabs===

====Root====

Each input list is associated with a root. Each list can have a different root. This root is displayed above the tabs, allowing you to see it even when another type of tab is selected.

To change a root:

* Select <cite>Edit Root</cite> in the <cite>Input</cite> menu (or the context menu),
* or press <tt>F2</tt>,
* or click on the browse button near the root.

To use for root the directory where the project is saved leaves the root field empty. If the project is not saved yet, the root will point to your home directory.

When a new project has not been saved yet, the default root is the user's home directory. The first time you add a document to one of the input lists, Rainbow will automatically change the root of that list as well as the <cite>Filter Parameters</cite> option to the directory where the added document is located.

If there is already an input document in the list and you add one that cannot use root currently defined, the tool will try to adjust the root automatically (and all the relative path of the document already in the list. If the documents are on different shares or drives you may not be able to adjust the root.

Expansion of system environment variables such as <code>${HOME}</code>, as well as the Okapi-specific <code>${rootDir}</code> (the location of the .rnb file when saved; the user's home directory when unsaved), is supported.

====Input Documents====

To add a document in a list:

* Select the <cite>Add Documents</cite> in the <cite>Input</cite> menu (or the context menu),
* or press <tt>Ctrl+Insert</tt>,
* or drag the documents and drop them on the list.

You can move the documents up and down in the list (for example to align them with another list) by using the commands <cite>Move Up</cite> and <cite>Move Down</cite> in the <cite>Input</cite> menu (or the context menu), or press <tt>Alt+Up</tt> and <tt>Alt+Down</tt>.

To select all documents in the list press <tt>Ctrl+A</tt>.

To remove documents from a list: Select the documents you want to remove and select <cite>Remove Documents</cite> in the <cite>Input</cite> menu (or the context menu), or press <tt>Delete</tt>. You will be prompted to confirm the removal. Removing a document from the list does not affect the real document in any way.

To associate a document with a specific filter configuration: Select the document and select <cite>Edit Document Properties</cite> in the <cite>Input</cite> menu (or context menu), or press <tt>Alt+Enter</tt>, or double-click on the <cite>Filter Configuration</cite> column. This will open the <cite>Input Document Properties</cite> dialog box where you can make changes.

===Languages and Encodings Tab===

====Source====

<cite>Language</cite> — Enter or select the code of the source language.

The language codes can be any type of BCP-47 code, or POSIX locales.

Note: Locale and Languages. Nowadays there is not many difference between a language code and a locale code, as the new language tags of the BCP-47 includes sub-tags that represent various regional or special variants, as well as script difference. For example, <code>ES-005</code> stands for Latin-America Spanish, <code>zh-Hant-tw</code> for Traditional Chinese used in Taiwan, etc. For more information about BCP-47 see http://tools.ietf.org/html/rfc4647, and this overview: http://www.w3.org/International/articles/language-tags/. See also the "[[How to Add Languages to Rainbow]]" article.

The terms locale and language used interchangeably in Rainbow's interface.

<cite>Encoding</cite> — Enter or select the name of the encoding for the source language. This is the default source encoding. You can overwrite this value for each input document in the <cite>Input Document Properties</cite> dialog.

You can obtain a list of all encodings supported by your system with the <cite>Tools</cite> > <cite>List Available Encodings</cite> command.

When used as an input encoding, the encoding defined here and in the <cite>Input Document Properties</cite> dialog, may be superseded by the encoding automatically detected in the input file.

====Target====

<cite>Language</cite> — Enter or select the code of the target language.

<cite>Encoding</cite> — Enter or select the name of the encoding for the target language. This is the default target encoding. You can overwrite this value for each input document in the <cite>Input Document Properties</cite> dialog.

===Other Settings Tab===

====Output====

Some utilities generate one output document for each input document (e.g. Encoding Conversion). You can choose the location and the name of these output files here.

<cite>Use this root</cite> — Set this option to replace the input root by the root entered in this field.

<cite>Custom sub-folder</cite> — Set this option to add an extra sub-directory between the root and the relative path of the output file. For example:

<pre><nowiki>
Original input root = C:\myProject
Relative input path = mySubDir\myFile.ext
Use this root = D:\myOutput
Custom sub-folder = ExtraSubDir

Before = C:\myProject\mySubDir\myFile.ext
After = D:\myOutput\ExtrasubDir\\mySubDir\myFile.ext
</nowiki></pre>

<cite>Use an extension</cite> — Set this option to use a specific extension in the output file. Enter the extension (with its period) in the field.

<cite>Replace</cite> — Set this option to replace the extension of the input filename by the one defined here. (For example: "<code>file.old</code>" becomes "<code>file.new</code>"). The old extension is the last part of the text preceded by a period. (For example: "<code>file.old1.old2</code>" becomes "<code>file.old1.new</code>").

<cite>Append</cite> — Set this option to add the extension defined here at the end of the existing one. (For example: "<code>file.old</code>" becomes "<code>file.old.new</code>").

<cite>Prepend</cite> — Set this option to place the extension defined here just before the existing one. (For example: "<code>file.old</code>" becomes "<code>file.new.old</code>").

<cite>Add the following prefix</cite> — Set this option to add a specific prefix text at the beginning of the filename. Enter the prefix string.

<cite>Add the following suffix</cite> — Set the option to add a specific suffix at the end of the filename (before any '.' character).

<cite>Replace this text</cite> — Set this option to replace the text given in this field by the one given in the <cite>By this text</cite> field.

<cite>By this text</cite> — Enter the text to replace the one provided in the <cite>Replace this text</cite> field.

Note that replacements are done on the full path (i.e. in the directory part as well as the filename part).

Anywhere in the output fields you can use a set of variables based on the source and target locales. They will be replaced by their corresponding runtime equivalents:

{{Locales Variables}}

====Filter Parameters====

<cite>Use custom parameters folder</cite> — Set this option to define an absolute location for the directory where the filter parameters files reside. If this option is not set, the filter parameters files are expected to be in the project directory

[[Category:Rainbow]]

Doxygen Filter

2012-10-10T08:17:51Z

Amake: /* Whitespace */

{{Filters Header}}
==Overview==

The Doxygen Filter is an Okapi component for extracting [http://www.stack.nl/~dimitri/doxygen/ Doxygen]-style comments from source code. An example:

/*! <span class="hi">A test class</span> */
class Test
{
public:
/** <span class="hi">An enum type.</span>
* <span class="hi">The documentation block cannot be put after the enum!</span>
*/
enum EnumType
{
int EVal1, /**< <span class="hi">enum value 1</span> */
int EVal2 /**< <span class="hi">enum value 2</span> */
};
void member(); //!< <span class="hi">a member function.</span>
protected:
int value; /*!< <span class="hi">an integer value</span> */
};

C++-style (<code>///</code>), Javadoc-style (<code>/**</code>), Qt-style (<code>/*!</code>), and Python-style (<code><nowiki>'''</nowiki></code> or <code>"""</code>) comment blocks are supported.

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input document using the following logic:

* If the file has a Unicode Byte-Order-Mark:
** Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
* Otherwise, the input encoding used is the default encoding that was specified when opening the document.

===Inline Codes===

The full set of Doxygen [http://www.stack.nl/~dimitri/doxygen/commands.html special commands], [http://www.stack.nl/~dimitri/doxygen/htmlcmds.html HTML commands], and [http://www.stack.nl/~dimitri/doxygen/xmlcmds.html XML commands] are recognized and interpreted. For instance,

/*! \class Test class.h "inc/class.h"
* \brief This is a test class.
*
* Some details about the Test class
*/

will be extracted to the following Text Units:

# <code><1/><2/> This is a test class.</code>
# <code>Some details about the Test class</code>

===Line Numbers===

The filter preserves line numbers so that a one-to-one correspondence between source line number and translated line number is maintained.

==Parameters==

Supported Doxygen commands are listed in one of three categories:

* <code>custom_commands</code>
* <code>doxygen_commands</code>
* <code>html_commands</code>

You can customize the behavior of the filter by editing existing entries or adding new ones. An example <code>doxygen_commands</code> entry:

doxygen_commands:
'''COMMAND_NAME''':
type: '''TYPE'''
inline: '''INLINE'''
pair: '''PAIR_CMD_NAME'''
translatable: '''CMD_TRANSLATABLE'''
parameters:
- name: '''PARAM_NAME'''
length: '''LENGTH'''
required: '''REQUIRED'''
translatable: '''PARAM_TRANSLATABLE'''
- ...

Replace '''bold''' items above with custom data conforming to the following.

{| class="wikitable"
!Item
!Description
!Example value
|-
|<code>COMMAND_NAME</code>
|The name of the command as it will appear in the Doxygen comment, without any prefix or suffix bits. E.g. <code>\code{.py}</code> should be <code>code</code>. Case-sensitive.
|<code>code</code>
|-
|<code>TYPE</code>
|The "type" of the command, specifically one of <code>PLACEHOLDER</code>, <code>OPENING</code>, or <code>CLOSING</code>.
|<code>PLACEHOLDER</code>
|-
|<code>INLINE</code>
|Whether the command should be considered an inline item (<code>true</code>) or a block-level element (<code>false</code>). Default: <code>false</code>.
|<code>true</code>
|-
|<code>PAIR_CMD_NAME</code>
|For <code>OPENING</code>- and <code>CLOSING</code>-type commands, this identifies the paired command. E.g. <code>\code</code> is paired with <code>\endcode</code>, so for <code>code</code> we have <code>pair: endcode</code>. Not required for <code>PLACEHOLDER</code> commands.
|<code>endcode</code>
|-
|<code>CMD_TRANSLATABLE</code>
|Indicates whether the entire content of the command is translatable or not. This is intended for block-level <code>OPENING</code> commands that delimit entire blocks such as <code>\code</code>. Default: <code>true</code>.
|<code>true</code>
|-
|<code>PARAM_NAME</code>
|The name of a parameter. This is for organizational purposes only, and is not used by the filter.
|<code>name</code>
|-
|<code>LENGTH</code>
|The length of the parameter, specifically one of <code>WORD</code>, <code>LINE</code>, <code>PHRASE</code> or <code>PARAGRAPH</code>. These map to the designations described at the top of the [http://www.stack.nl/~dimitri/doxygen/commands.html special commands page], except for <code>PHRASE</code> which indicates a string bounded by double quotes like <code>"image caption"</code>.
|<code>WORD</code>
|-
|<code>REQUIRED</code>
|Whether the parameter is required (<code>true</code>) or optional (<code>false</code>). This affects how aggressively the filter tries to interpret proceeding text as a parameter. Default: <code>true</code>.
|<code>true</code>
|-
|<code>PARAM_TRANSLATABLE</code>
|Indicates whether the parameter is translatable (<code>true</code>) or not (<code>false</code>). Each parameter may be set independently, though untranslatable parameters following translatable ones will be recorded as separate inline codes. Default: <code>true</code>.
|<code>true</code>
|}

Note:
* The <code>parameters</code> listing is optional.
* When present, parameters should be listed in the order in which they are written following the command.
* Parameters with non-whitespace delimiters (e.g. <code>'''.py'''</code> in <code>\code{'''.py'''}</code>) are not currently supported.

You may also define custom commands as follows (all of the above options except <code>COMMAND_NAME</code> are supported; the following is a minimal case):

custom_commands:
- pattern: "'''REGEX_PATTERN'''"
type: '''TYPE'''
...

{| class="wikitable"
!Item
!Description
!Example value
|-
|<code>REGEX_PATTERN</code>
|Any valid regex that matches non-zero-width runs of text within the comment. Matches will be turned into codes according to the parameters as described above.
|<code>###ACCESS_CHECKS###.*?;</code>
|}

===Whitespace===
Prevent the filter from collapsing whitespace by setting <code>preserve_whitespace: true</code>.

==Limitations==

* Single linebreaks in a text run that are not part of a Doxygen command are collapsed. No effort is made to enforce a maximum line width upon output, so essentially each translatable paragraph will be collapsed to a single (potentially very long) line.
* Command parameters with non-whitespace delimiters (e.g. <code>'''.py'''</code> in <code>\code{'''.py'''}</code>) are not currently supported.
* Non-translatable command parameters are not exposed for any special processing.

[[Category:Filters]]

Regex Filter

2012-10-10T06:11:12Z

Amake: /* Overview */ Link fix

{{Filters Header}}
==Overview==

The Regex Filter is an Okapi component that implements the IFilter interface for any type of text-based formats where the text can be captured using [[Regular Expressions|regular expressions]]. The filter is implemented in the <code>class net.sf.okapi.filters.regex.RegexFilter</code> of the library.

The filter can work with any text-based document. You define rules with regular expressions that indicate what part of the document to process. Each rule is associated with an action telling the filter what to do with the different capturing groups of its regular expression.

For example, if you have the following input document:

[ID1]=Text for ID1
[ID2]:Text for ID2

...and a rule with the following regular expression:

^\[(.*?)](=|:)(.*?)$

...and that rule is set to the action <cite>Extract the content</cite> and has the capturing group 3 assigned to the source group and the capturing group 1 assigned to the identifier group.

...then:

* Each line in the input document will match the rule.
* A new text unit will be created for each match, with its name set to the content of the capturing group 1, and its source text set to the content of the capturing group 3.

[<span class="green">ID1</span>]=<span class="hi">Text for ID1</span>
[<span class="green">ID2</span>]:<span class="hi">Text for ID2</span>

^\[<span class="green">(.*?)</span>](=|:)<span class="hi">(.*?)</span>$

And if you were to represent the parsed information in XLIFF, it would look something like this:

...
<body>
<trans-unit id="1" resname="<span class="green">ID1</span>" xml:space="preserve">
<source xml:lang="en"><span class="hi">Text for ID1</span></source>
</trans-unit>
<trans-unit id="2" resname="<span class="green">ID2</span>" xml:space="preserve">
<source xml:lang="en"><span class="hi">Text for ID2</span></source>
</trans-unit>
</body>
...

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input document using the following logic:

* If the file has a Unicode Byte-Order-Mark:
** Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
* Otherwise, the input encoding used is the default encoding that was specified when opening the document.

===Output Encoding===

The filter does not recognize any encoding declarations in the document, and therefore cannot update them.

If the output encoding is UTF-8:

* If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
* If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

===Line-Breaks===

The type of line-breaks of the output is the same as the one of the original input.

===Parsing===

Here is how an input document is parsed:

# The filter sets the current search position at the top of the document.
# It searches for the first possible rule that has a match from a current position.
# It takes the match and applies whatever action is associated with the rule.
# It moves the current search position at the end of the match.
# The steps 2, 3, and 4 are repeated until no more matches are found or the search position reaches the end of the document.

===Actions===

Each rule is associated with one of several possible actions. Depending on the action, you can associate different parts of the text that matches the rule with a specific role. This is done with the capturing groups. The source group, the target group, the identifier group and the note group.

A capturing group is a part of the regular expression between parentheses. The capturing group 0 is the whole match, then other capturing groups are numbered by counting their opening parentheses from left to right. For example, in the expression (A)(B(C)) there are three groups:

# (A)
# (B(C))
# (C)

The following table summarizes what each action does, and what the different groups it may use:

{| border="1" cellpadding="5" cellspacing="0"
|+
| '''Action''' || '''Effect''' || '''Source''' || '''Target''' || '''Identifier''' || '''Note'''
|- valign="top"
| <cite>Extract the strings in the source group</cite>
| Sends a <code>TEXT_UNIT</code> event for each string found in the source capturing group.
| Must be defined. It is where the string or strings to extract are taken from.
| Not used.
| If defined: It is the name for the first text unit. If there is more than one string to extract, a sequential number (starting at 2) is appended to it, and used as the name of the other text units.
| If defined: It is the ''note'' property associated to each text unit corresponding to each extracted string.
|- valign="top"
| <cite>Extract the content of the source group</cite>
| Sends a single <code>TEXT_UNIT</code> event based on the different capturing groups.
| Must be defined. It is the source text of the text unit.
| If defined: It is the target text of the text unit.
| If defined: It is the name of the text unit.
| If defined: It is the note property associated to the text unit.
|- valign="top"
| <cite>Treat the source group as comment</cite>
| Process the source capturing group for localization directives (if requested) and leaves the content of the whole expression's match untouched.
| Must be defined. It is processed for localization directives if that option is set.
| Not used.
| Not used.
| Not used.
|- valign="top"
| <cite>Do not extract</cite>
| Leaves the content of the whole expression's match untouched.
| Not used.
| Not used.
| Not used.
| Not used.
|- valign="top"
| <cite>Start a section</cite>
| Sends a <code>START_GROUP</code> event. If the option <cite>Auto-close previous section when a new one starts</cite> is set, you '''must not''' define a corresponding end section. If that option is not set, you '''must''' define a rule to close this section.
| Not used.
| Not used.
| If defined: It is the name of the section being opened. A section corresponds to a <code><group></code> in XLIFF.
| If defined: It is the note property associated to the section being opened.
|- valign="top"
| <cite>End a section</cite>
| Sends an <code>END_GROUP</code> event.
| Not used.
| Not used.
| Not used.
| Not used.
|}

==Parameters==

===Rules Tab===

<cite>Add</cite> — Click this button to add a new rule to the list. This opens the <cite>Edit Rule</cite> dialog box with the new rule.

<cite>Rename</cite> — Click this button to rename the rule currently selected. Note that two rules can have the same name, but this is obviously not recommended.

<cite>Remove</cite> — Click this button to delete the rule currently selected from the list. No confirmation is asked.

<cite>Edit</cite> — Click this button to edit the rule currently selected. This opens the <cite>Edit Rule</cite> dialog box.

<cite>Move Up</cite> — Click this button to move the rule currently selected up in the list. Rules are evaluated in the order of the list.

<cite>Move Down</cite> — Click this button to move the rule currently selected down in the list. Rules are evaluated in the order of the list.

====Rule properties====

<cite>Preserve white spaces</cite> — Set this option to preserve all white spaces of the extracted text. If this option is not set the extracted content is unwrapped: That is any sequence of consecutive white spaces is replaced by a single space character, and any white space character at the start or the end of the content is trimmed out. White spaces here are: spaces, tabs, carriage returns, and line-feeds.

<cite>Has inline codes</cite> — Set this option to enable the conversion of some part of the extracted text into inline codes.

<cite>Edit Inline Codes Patterns</cite> — Click this button to open the <cite>Inline Codes Patterns</cite> dialog box where you can define rules for converting parts of text into inline codes.

<cite>Auto-close previous section when a new one starts</cite> — Set this option to automatically close any opened section when a new one is starting. Section are defined with the <cite>Start a section</cite> action. This option allows you to define only start of sections. If this option is not set, each <cite>Start a section</cite> action must have a corresponding <cite>End a section</cite> action.

====Regular expressions options====

This set of options are used for all rules defined in the list. If you need to overwrite an option for a given rule, use the <code>(?idmsux-idmsux)</code> construct in the pattern for that rule.

<cite>Dot also matches line-feed</cite> — Set this option to enable the dot operator to match line-feeds.

<cite>Multi-line</cite> — Set this option so the expressions <code>^</code> and <code>$</code> match just after or just before, respectively, a line terminator or the end of the input sequence. If this option is not set these expressions only match at the beginning and the end of the entire input sequence.

<cite>Ignore case differences</cite> — Set this option to ignore differences between letter cases. If this option is set "<code>abc</code>" is seen as identical as "<code>Abc</code>". If this option is not set, both strings are seen as different.

===Options Tab===

====Localization directives====

<cite>Use localization directives when they are present</cite> — Set this option to enable the filter to recognize localization directives. If this option is not set, any localization directive in the input file will be ignored.

<cite>Extract items outside the scope of localization directives</cite> — Set this option to extract any translatable item that is not within the scope of a localization directive. Selecting to extract or not outside localization directives allows you to mark up fewer parts of the source document. This option is enabled only when the Use localization directives when they are present option is set.

====Strings====

<cite>Beginning of string</cite> — Enter the character specifying the start of a string. Entering several characters defines several ways to start a string.

<cite>End of string</cite> — Enter the character specifying the end of a string. If you have defined several beginning characters, you must defined an equal number of end characters, and the position of each end character must correspond to the position of its corresponding beginning character.

<cite>Escaped characters use back-slash prefix</cite> — Set this option if the way to escape a character is to have a back-slash prefix (e.g. <code>\"</code>).

<cite>Escaped characters are doubled</cite> — Set this option if the way to escape a character is to double it (e.g. <code>""</code>).

====Content type====

<cite>MIME type of the document</cite> — Enter the MIME type value to use when extracting content with this parameters. The value is used to identify the type of document. It may also change the way the text is written back into the original format. Most of the time <code>text/plain</code> should be fine.

==Limitations==

* The whole document is loaded in memory to apply the regular expressions. This may cause issues with very large documents.
* The option <cite>Extract strings outside the rules</cite> is not yet implemented.

[[Category:Filters]]