Okapi Framework - User contributions [en]

Filters

2025-11-14T09:42:57Z

Dkonovalyenko: /* Code Simplification Rules */

Filters are the components that convert input documents from their native file format into a common internal set of [[Glossary#Resource|resources]] that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the [[Raw Document to Filter Events Step]] and the re-writing by the [[Filter Events to Raw Document Step]].

Note: The [[Okapi Filters Plugin for OmegaT]] allows you to use some of the filters directly from [http://www.omegat.org OmegaT].

==List of the Filters==

The framework distribution comes with the following filters:

{| cellpadding="8" width=100%
|- valign="top"
|
* [[Archive Filter]]
* [[DTD Filter]]
* [[Doxygen Filter]]
* [[EPUB Filter]]
* [[HTML Filter]]
* [[HTML5-ITS Filter]]
* [[ICML Filter]]
* [[IDML Filter]]
* [[JSON Filter]]
* [[Markdown Filter]]
* [[Message Format Filter]]
* [[MIF Filter]]
* [[Moses Text Filter]]
* [[Multi-Parsers Filter]]
* [[OpenOffice Filter]]
* [[OpenXML Filter|OpenXML (MS Office) Filter]]
|
* [[PDF Filter]]
* [[Pensieve TM Filter]]
* [[PHP Content Filter]]
* [[Plain Text Filter]]
* [[PO Filter]]
* [[Properties Filter]]
* [[Rainbow Translation Kit Filter]]
* [[Regex Filter]]
* [[SDL Trados Package Filter]]
* [[Simplification Filter]]
* [[Table Filter]]
* [[TMX Filter]]
* [[Trados-Tagged RTF Filter]]
|
* [[Transifex Filter]]
* [[TS Filter]]
* [[TTX Filter]]
* [[TXML Filter]]
* [[Wiki Filter]]
* [[WSXZ Package Filter]]
* [[Vignette Filter]]
* [[XLIFF Filter]]
* [[XLIFF-2 Filter]]
* [[XML Filter]]
* [[XML Stream Filter]]
* [[YAML Filter]]
|}

==Supported File Formats==

The following is a list of some of the file formats supported by the distribution through [[Understanding Filter Configurations|pre-defined configurations]]:

{| border="1" cellpadding="6" cellspacing="0"
|+
| '''Format''' || '''Extensions''' || '''Pre-Defined Configuration''' || '''Filter''' || '''Notes'''
|- valign="top"
| Android Strings || .xml || <code>okf_xml-AndroidStrings</code> || [[XML Filter]] ||
|- valign="top"
| Apple Stringsdict || .stringsdict || <code>okf_xml-AppleStringsdict</code> || [[XML Filter]] ||
|- valign="top"
| Archive || .zip || <code>okf_archive</code> || [[Archive Filter]] || Meta filter that processes zip files with various formats as one file.
|- valign="top"
| Auto Xliff || .xlf, .xliff || <code>okf_autoxliff</code> || [[Auto Xliff Filter]] || Detects the version of an XLIFF file and then hands parsing off to the appropriate filter
|- valign="top"
| CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]] ||
|- valign="top"
| CSV (Multiple complex sub-formats) || .csv || <code>okf_multiparsers</code> || [[Multi-Parsers Filter]] ||
|- valign="top"
| DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]] ||
|- valign="top"
| DocBook v5.0 || .xml || <code>okf_xml-docbook</code> || [[XML Filter]] || Since Okapi 1.42. <footnote> is not handled properly.
|- valign="top"
| DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]] ||
|- valign="top"
| Doxygen-commented files || .c, .h, cpp || <code>okf_doxygen</code> || [[Doxygen Filter]] ||
|- valign="top"
| DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]] ||
|- valign="top"
| EPUB || .epub || <code>okf_epub</code> || [[EPUB Filter]] ||
|- valign="top"
| Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]] ||
|- valign="top"
| Idiom WorldServer XLIFF || .xlf || <code>okf_xliff-iws</code> || [[XLIFF Filter]] ||
|- valign="top"
| InCopy ICML || .wcml || <code>okf_icml</code> || [[ICML Filter]] ||
|- valign="top"
| InDesign IDML || .idml || <code>okf_idml</code> || [[IDML Filter]] ||
|- valign="top"
| iOS/Mac Strings|| .strings || <code>okf_regex-macStrings</code> || [[Regex Filter]] ||
|- valign="top"
| Java Properties || .properties || <code>okf_properties</code> || [[Properties Filter]] ||
|- valign="top"
| Java Properties (Output not escaped) || .properties || <code>okf_properties-outputNotEscaped</code> || [[Properties Filter]] ||
|- valign="top"
| Java XML Properties || .xml || <code>okf_xml-JavaProperties</code> || [[XML Filter]] ||
|- valign="top"
| Java XML Properties (HTML strings) || .xml || <code>okf_xmlstream-JavaPropertiesHTML</code> || [[XML Stream Filter]] ||
|- valign="top"
| JSON || .json || <code>okf_json</code> || [[JSON Filter]] ||
|- valign="top"
| Haiku CatKeys || .catkeys || <code>okf_table_catkeys</code> || [[Table Filter]] ||
|- valign="top"
| HTML (any) || .html, .htm || <code>okf_html</code> || [[HTML Filter]] ||
|- valign="top"
| HTML (Well-formed, and XHTML) || .html, .htm|| <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] ||
|- valign="top"
| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]] ||
|- valign="top"
| Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft PowerPoint 2007/2010 || .pptx, .pptm, .potx, .potm, .ppsx, .ppsm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Visio || .vsdx, .vsdm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Word 2007/2010 || .docx, .docm, .dotx, .dotm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| MIF || .mif || <code>okf_mif</code> || [[MIF Filter]] ||
|- valign="top"
| Moses Text || .txt || <code>okf_mosestext</code> || [[Moses Text Filter]] ||
|- valign="top"
| OpenOffice.org Calc || .ods, .ots || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Draw || .odg, .otg || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Impress || .odp, .otp || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Writer || .odt, .ott || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| PDF || .pdf || <code>okf_pdf</code> || [[PDF Filter]] ||
|- valign="top"
| [[Pensieve TM]] || .pentm || <code>okf_pensieve</code> || [[Pensieve TM Filter]] ||
|- valign="top"
| PHP Content || .php || <code>okf_phpcontent</code> || [[PHP Content Filter]] || Can be used as a subfilter only
|- valign="top"
| Plain Text (Line = text unit) || .txt || <code>okf_plaintext</code> || [[ Plain Text Filter]] ||
|- valign="top"
| Plain Text (Paragraph = text unit) || .txt || <code>okf_plaintext_paragraphs</code> || [[Plain Text Filter]] ||
|- valign="top"
| PO || .po || <code>okf_po</code> || [[PO Filter]] ||
|- valign="top"
| PO (Monolingual style) || .po || <code>okf_po-monolingual</code> || [[PO Filter]] ||
|- valign="top"
| Rainbow Translation Kit manifests || .rkm || <code>okf_rainbowkit</code> || [[Rainbow Translation Kit Filter]] || Used as a tkit reader only
|- valign="top"
| Regex (Any text-based format) || .txt || <code>okf_regex</code> || [[Regex Filter]] ||
|- valign="top"
| RDF (Mozilla RDF) || .rdf || <code>okf_xml-MozillaRDF</code> || [[XML Filter]] ||
|- valign="top"
| RESX || .resx || <code>okf_xml-resx</code> || [[XML Filter]] ||
|- valign="top"
| SDLPPX || .sdlppx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDLRPX || .sdlrpx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDL[[XLIFF]] || .sdlxlf || <code>okf_xliff-sdl</code> || [[XLIFF Filter]] ||
|- valign="top"
| Skype Language Files || .lang || <code>okf_properties-skypeLang</code> || [[Properties Filter]] ||
|- valign="top"
| SRT (Sub-Rip Text, sub-titles files) || .srt || <code>okf_regex-srt</code> || [[Regex Filter]] ||
|- valign="top"
| Tab-Delimiter files || .tsv, .txt || <code>okf_table_tsv</code> || [[Table Filter]] ||
|- valign="top"
| Tex files || .tex || <code>okf_tex</code> || [[TEX Filter]] ||
|- valign="top"
| [[TMX]] || .tmx || <code>okf_tmx</code> || [[TMX Filter]] ||
|- valign="top"
| Transifex project || .txp || <code>okf_transifex</code> || [[Transifex Filter]] ||
|- valign="top"
| Trados-Tagged RTF || .rtf || <code>okf_tradosrtf</code> || [[Trados-Tagged RTF Filter]] ||
|- valign="top"
| TS - Qt TS files || .ts || <code>okf_ts</code> || [[TS Filter]] ||
|- valign="top"
| TTX - Trados TagEditor TTX files || .ttx || <code>okf_ttx</code> || [[TTX Filter]] ||
|- valign="top"
| TXML - Wordfast Pro TXML files || .txml || <code>okf_txml</code> || [[TXML Filter]] ||
|- valign="top"
| Vignette Export/Import Content || .xml || <code>okf_vignette</code> || [[Vignette Filter]] ||
|- valign="top"
| WSXZ Package Filter || .wsxz || <code>okf_wsxzpackage</code> || [[WSXZ Package Filter]] ||
|- valign="top"
| XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| WIX (Windows Installer XML) localization files || .wix || <code>okf_xml-WixLocalization</code> || [[XML Filter]] ||
|- valign="top"
| [[XLIFF]] v1.2 || .xlf, .xliff || <code>okf_xliff</code> || [[XLIFF Filter]] ||
|- valign="top"
| [[XLIFF]] v2 || .xlf || <code>okf_xliff2</code> || [[XLIFF-2 Filter]] ||
|- valign="top"
| XML (Generic, using [[ITS]] defaults) || .xml || <code>okf_xml</code> || [[XML Filter]] ||
|- valign="top"
| XML (Generic, using stream reader) || .xml || <code>okf_xmlstream</code> || [[XML Stream Filter]] ||
|- valign="top"
| YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]] ||
|- valign="top"
| Message Format (ICU Message Format Filter) || Any container format that supports subfilters || <code>okf_messageformat</code> || [[Message Format Filter]] ||
|}

Note that most filters allow you to [[Understanding Filter Configurations|create your own configurations]] to support more file formats.

==Code Simplification Rules==

There are two levels of code simplification: filter and step (the [[Inline Codes Simplifier Step]] and [[Post-segmentation Inline Codes Removal Step]]). And there are different ways of configuring it:

Firstly, the extraction pipeline can contain just:
: - [[Raw Document to Filter Events Step]]

At the moment, only [[IDML Filter]], [[XML Filter]] and [[Simplification Filter]] support this. It should be noted that the last one performs like a wrapper for another filter.

Secondly, the extraction pipeline can look like that:
: - [[Raw Document to Filter Events Step]]
: - [[Inline Codes Simplifier Step]]

This is the only way for filters that do not support their own code simplification, and it should be used with care because the final merge may not always handle this correctly. The aforementioned [[IDML Filter]] and [[XML Filter]] can perform their own simplification, and the added [[Inline Codes Simplifier Step]] should not affect the events produced.

Thirdly, the extraction pipeline can consist of:
: - [[Raw Document to Filter Events Step]]
: - [[Segmentation Step]]
: - [[Post-segmentation Inline Codes Removal Step]]

Here, the [[Post-segmentation Inline Codes Removal Step]] performs code simplification after segmentation rules are applied, and it may be useful for skipping extra codes between segments.

By default, the [[Inline Codes Simplifier Step]] and [[Post-segmentation Inline Codes Removal Step]] maximise the trimming and merging (aka simplification) of inline codes. This can be tuned via the following string parameters:
: - <code>removeLeadingTrailingCodes</code> - <code>true</code> by default
: - <code>mergeCodes</code> - <code>true</code> by default
: - <code>rules</code> - empty by default

Only the [[Inline Codes Simplifier Step]] configuration can be overridden by the optional filter ones via the following parameters:
: - <code>moveLeadingAndTrailingCodesToSkeleton</code> - maps to the <code>removeLeadingTrailingCodes</code>
: - <code>mergeAdjacentCodes</code> - maps to the <code>mergeCodes</code>
: - <code>simplifierRules</code> - maps to the <code>rules</code>

The simplification rules allow the prevention of specific codes trimming or merging.

===General Syntax===

The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies, it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.

For more details, see the JavaCC grammar: <code>../okapi/core/src/main/javacc/SimplifierRules.jj</code>

===Rule Examples===

If Code has any of these flags, then don't simplify

<pre>if DELETABLE or ADDABLE or CLONEABLE;</pre>

"=" is string match
Match basic TAGTYPE opening, closing or standalone

<pre>if DATA = "a" and TAGTYPE = OPENING;</pre>

"~" is regex match

<pre>if DATA ~ "a.*";</pre>

You can negate any of the match operators
Don't simplify if the DATA does not match the regex

<pre>if DATA !~ "a.*";</pre>

Match on type, linebreak in this case, don't simplify

<pre>if TYPE = "lb";</pre>

Don't simplify any rich text types

<pre>if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";</pre>

Expressions can be recursive (supports embedded parens)

<pre>if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));</pre>

===Filter Config Examples===

Examples of using simplifier rules within the filter config formats used by Okapi.

'''YAML:'''

<pre>
simplifierRules: |
if ADDABLE or DELETABLE or CLONEABLE;
if DATA = " " or DATA = "" or DATA = "" or DATA = "</a>";
if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";
</pre>

'''ITS:'''

<pre>
<?xml version="1.0" encoding="UTF-8"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options">

<its:translateRule selector="//*" translate="yes"/>
<its:withinTextRule selector="//codeph" withinText="yes"/>
<its:withinTextRule selector="//ph" withinText="yes"/>
<okp:simplifierRules moveLeadingAndTrailingCodesToSkeleton="yes" mergeAdjacentCodes="yes">
if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</okp:simplifierRules>
</its:rules>
</pre>

'''FPRM (Parameters):'''

<pre>
#v1
extractNotes.b=true
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</pre>

==Font Mapping==

The font mapping can be considered as a filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX, PPTX and XLSX documents) filters at the moment.

The following font mapping configuration options are available:
* The source locale regular expression pattern: <code>.*</code>, <code>en.*</code>, <code>en-UK</code>, etc. It can be ommited to apply the mapping to any source locale.
* The target locale regular expression pattern: <code>.*</code>, <code>ru.*</code>, <code>ru-RU</code>, etc. It can be ommited to apply the mapping to any target locale.
* The source font name regular expression pattern: <code>.*</code>, <code>Arial.*</code>, <code>Times New Roman</code>, etc. It can be ommited to apply the mapping to any source font name found.
* The target font name: <code>Arial</code>, <code>Times New Roman</code>, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.

Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential
substitution of the source font values. I.e. if there is more than one mapping:
# <code>Arial</code> -> <code>Times New Roman</code>
# <code>Times New Roman</code> -> <code>Sans Serif</code>
then the first mapping will produce <code>Times New Roman</code> replacement and the second one will be applied to this new value, thus, ending up with the <code>Sans Serif</code>.

The parameters serialisation format can look like that:

<pre>
fontMappings.0.sourceLocalePattern=en.*
fontMappings.0.targetLocalePattern=ru.*
fontMappings.0.sourceFontPattern=Times.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.1.sourceLocalePattern=ru
fontMappings.1.targetLocalePattern=fr
fontMappings.1.sourceFontPattern=The Sims Sans
fontMappings.1.targetFont=Arial Unicode MS
fontMappings.number.i=2
</pre>

When source locale, target locale and source font are omitted:

<pre>
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1
</pre>

And this is the same as the abovementioned:

<pre>
fontMappings.0.sourceLocalePattern=.*
fontMappings.0.targetLocalePattern=.*
fontMappings.0.sourceFontPattern=.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1
</pre>

[[Category:Filters]]

Post-segmentation Inline Codes Removal Step

2025-11-14T09:37:32Z

Dkonovalyenko: /* Parameters */

{{Steps Header}}
__TOC__
==Overview==

This step attempts to simplify (trim and merge) as many inline codes as possible by looking at each linguistically distinct segment in a TextUnit.

'''The step must be run after segmentation.''' Joins adjacent inline codes inside segments, and optionally moves leading and trailing codes from the segment to an inter-segment Textpart. Original (un-merged) codes are saved as okp:merged attributes inside the generated XLIFF file. Trimmed codes are simply written outside the "mrk" elements.

Takes: Filter Events. Sends: Filter Events.

==Parameters==

<cite>Remove leading and trailing codes</cite> — Set this option to remove leading and trailing inline codes from the text units and place them outside the segment.

<cite>Merge codes</cite> — Set this option to merge adjacent inline codes in the text units.

==Limitations==

Currently bi-lingual formats such as XLIFF, TMX, TTX etc. will not have their codes simplified as the codes may differ in source and target. Codes must align with id's across source and target.

[[Category:Steps]]

Inline Codes Simplifier Step

2025-11-14T09:35:53Z

Dkonovalyenko: /* Parameters */

{{Steps Header}}
__TOC__
==Overview==

This step joins adjacent inline codes in text units, and optionally moves leading and trailing codes from the text unit to the skeleton.

Takes: Filter Events. Sends: Filter Events.

Only the source codes are affected by this step. So it should be used before any target is created in the text unit.

==Parameters==

<cite>Remove leading and trailing codes</cite> — Set this option to remove leading and trailing inline codes from the text units and place them into the skeleton.

<cite>Merge codes</cite> — Set this option to merge adjacent inline codes in the text units.

==Limitations==

None known.

[[Category:Steps]]

Inline Codes Simplifier Step

2025-11-14T09:35:06Z

Dkonovalyenko: /* Parameters */

{{Steps Header}}
__TOC__
==Overview==

This step joins adjacent inline codes in text units, and optionally moves leading and trailing codes from the text unit to the skeleton.

Takes: Filter Events. Sends: Filter Events.

Only the source codes are affected by this step. So it should be used before any target is created in the text unit.

==Parameters==

<cite>Remove leading and trailing codes</cite> — Set this option to remove leading and trailing inline codes from the text units and place them into the skeleton.
<cite>Merge codes</cite> — Set this option to merge adjacent inline codes in the text units.

==Limitations==

None known.

[[Category:Steps]]

Filters

2025-09-23T13:48:35Z

Dkonovalyenko: /* Code Simplification Rules */

Filters are the components that convert input documents from their native file format into a common internal set of [[Glossary#Resource|resources]] that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the [[Raw Document to Filter Events Step]] and the re-writing by the [[Filter Events to Raw Document Step]].

Note: The [[Okapi Filters Plugin for OmegaT]] allows you to use some of the filters directly from [http://www.omegat.org OmegaT].

==List of the Filters==

The framework distribution comes with the following filters:

{| cellpadding="8" width=100%
|- valign="top"
|
* [[Archive Filter]]
* [[DTD Filter]]
* [[Doxygen Filter]]
* [[EPUB Filter]]
* [[HTML Filter]]
* [[HTML5-ITS Filter]]
* [[ICML Filter]]
* [[IDML Filter]]
* [[JSON Filter]]
* [[Markdown Filter]]
* [[Message Format Filter]]
* [[MIF Filter]]
* [[Moses Text Filter]]
* [[Multi-Parsers Filter]]
* [[OpenOffice Filter]]
* [[OpenXML Filter|OpenXML (MS Office) Filter]]
|
* [[PDF Filter]]
* [[Pensieve TM Filter]]
* [[PHP Content Filter]]
* [[Plain Text Filter]]
* [[PO Filter]]
* [[Properties Filter]]
* [[Rainbow Translation Kit Filter]]
* [[Regex Filter]]
* [[SDL Trados Package Filter]]
* [[Simplification Filter]]
* [[Table Filter]]
* [[TMX Filter]]
* [[Trados-Tagged RTF Filter]]
|
* [[Transifex Filter]]
* [[TS Filter]]
* [[TTX Filter]]
* [[TXML Filter]]
* [[Wiki Filter]]
* [[WSXZ Package Filter]]
* [[Vignette Filter]]
* [[XLIFF Filter]]
* [[XLIFF-2 Filter]]
* [[XML Filter]]
* [[XML Stream Filter]]
* [[YAML Filter]]
|}

==Supported File Formats==

The following is a list of some of the file formats supported by the distribution through [[Understanding Filter Configurations|pre-defined configurations]]:

{| border="1" cellpadding="6" cellspacing="0"
|+
| '''Format''' || '''Extensions''' || '''Pre-Defined Configuration''' || '''Filter''' || '''Notes'''
|- valign="top"
| Android Strings || .xml || <code>okf_xml-AndroidStrings</code> || [[XML Filter]] ||
|- valign="top"
| Apple Stringsdict || .stringsdict || <code>okf_xml-AppleStringsdict</code> || [[XML Filter]] ||
|- valign="top"
| Archive || .zip || <code>okf_archive</code> || [[Archive Filter]] || Meta filter that processes zip files with various formats as one file.
|- valign="top"
| Auto Xliff || .xlf, .xliff || <code>okf_autoxliff</code> || [[Auto Xliff Filter]] || Detects the version of an XLIFF file and then hands parsing off to the appropriate filter
|- valign="top"
| CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]] ||
|- valign="top"
| CSV (Multiple complex sub-formats) || .csv || <code>okf_multiparsers</code> || [[Multi-Parsers Filter]] ||
|- valign="top"
| DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]] ||
|- valign="top"
| DocBook v5.0 || .xml || <code>okf_xml-docbook</code> || [[XML Filter]] || Since Okapi 1.42. <footnote> is not handled properly.
|- valign="top"
| DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]] ||
|- valign="top"
| Doxygen-commented files || .c, .h, cpp || <code>okf_doxygen</code> || [[Doxygen Filter]] ||
|- valign="top"
| DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]] ||
|- valign="top"
| EPUB || .epub || <code>okf_epub</code> || [[EPUB Filter]] ||
|- valign="top"
| Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]] ||
|- valign="top"
| Idiom WorldServer XLIFF || .xlf || <code>okf_xliff-iws</code> || [[XLIFF Filter]] ||
|- valign="top"
| InCopy ICML || .wcml || <code>okf_icml</code> || [[ICML Filter]] ||
|- valign="top"
| InDesign IDML || .idml || <code>okf_idml</code> || [[IDML Filter]] ||
|- valign="top"
| iOS/Mac Strings|| .strings || <code>okf_regex-macStrings</code> || [[Regex Filter]] ||
|- valign="top"
| Java Properties || .properties || <code>okf_properties</code> || [[Properties Filter]] ||
|- valign="top"
| Java Properties (Output not escaped) || .properties || <code>okf_properties-outputNotEscaped</code> || [[Properties Filter]] ||
|- valign="top"
| Java XML Properties || .xml || <code>okf_xml-JavaProperties</code> || [[XML Filter]] ||
|- valign="top"
| Java XML Properties (HTML strings) || .xml || <code>okf_xmlstream-JavaPropertiesHTML</code> || [[XML Stream Filter]] ||
|- valign="top"
| JSON || .json || <code>okf_json</code> || [[JSON Filter]] ||
|- valign="top"
| Haiku CatKeys || .catkeys || <code>okf_table_catkeys</code> || [[Table Filter]] ||
|- valign="top"
| HTML (any) || .html, .htm || <code>okf_html</code> || [[HTML Filter]] ||
|- valign="top"
| HTML (Well-formed, and XHTML) || .html, .htm|| <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] ||
|- valign="top"
| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]] ||
|- valign="top"
| Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft PowerPoint 2007/2010 || .pptx, .pptm, .potx, .potm, .ppsx, .ppsm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Visio || .vsdx, .vsdm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Word 2007/2010 || .docx, .docm, .dotx, .dotm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| MIF || .mif || <code>okf_mif</code> || [[MIF Filter]] ||
|- valign="top"
| Moses Text || .txt || <code>okf_mosestext</code> || [[Moses Text Filter]] ||
|- valign="top"
| OpenOffice.org Calc || .ods, .ots || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Draw || .odg, .otg || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Impress || .odp, .otp || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Writer || .odt, .ott || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| PDF || .pdf || <code>okf_pdf</code> || [[PDF Filter]] ||
|- valign="top"
| [[Pensieve TM]] || .pentm || <code>okf_pensieve</code> || [[Pensieve TM Filter]] ||
|- valign="top"
| PHP Content || .php || <code>okf_phpcontent</code> || [[PHP Content Filter]] || Can be used as a subfilter only
|- valign="top"
| Plain Text (Line = text unit) || .txt || <code>okf_plaintext</code> || [[ Plain Text Filter]] ||
|- valign="top"
| Plain Text (Paragraph = text unit) || .txt || <code>okf_plaintext_paragraphs</code> || [[Plain Text Filter]] ||
|- valign="top"
| PO || .po || <code>okf_po</code> || [[PO Filter]] ||
|- valign="top"
| PO (Monolingual style) || .po || <code>okf_po-monolingual</code> || [[PO Filter]] ||
|- valign="top"
| Rainbow Translation Kit manifests || .rkm || <code>okf_rainbowkit</code> || [[Rainbow Translation Kit Filter]] || Used as a tkit reader only
|- valign="top"
| Regex (Any text-based format) || .txt || <code>okf_regex</code> || [[Regex Filter]] ||
|- valign="top"
| RDF (Mozilla RDF) || .rdf || <code>okf_xml-MozillaRDF</code> || [[XML Filter]] ||
|- valign="top"
| RESX || .resx || <code>okf_xml-resx</code> || [[XML Filter]] ||
|- valign="top"
| SDLPPX || .sdlppx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDLRPX || .sdlrpx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDL[[XLIFF]] || .sdlxlf || <code>okf_xliff-sdl</code> || [[XLIFF Filter]] ||
|- valign="top"
| Skype Language Files || .lang || <code>okf_properties-skypeLang</code> || [[Properties Filter]] ||
|- valign="top"
| SRT (Sub-Rip Text, sub-titles files) || .srt || <code>okf_regex-srt</code> || [[Regex Filter]] ||
|- valign="top"
| Tab-Delimiter files || .tsv, .txt || <code>okf_table_tsv</code> || [[Table Filter]] ||
|- valign="top"
| Tex files || .tex || <code>okf_tex</code> || [[TEX Filter]] ||
|- valign="top"
| [[TMX]] || .tmx || <code>okf_tmx</code> || [[TMX Filter]] ||
|- valign="top"
| Transifex project || .txp || <code>okf_transifex</code> || [[Transifex Filter]] ||
|- valign="top"
| Trados-Tagged RTF || .rtf || <code>okf_tradosrtf</code> || [[Trados-Tagged RTF Filter]] ||
|- valign="top"
| TS - Qt TS files || .ts || <code>okf_ts</code> || [[TS Filter]] ||
|- valign="top"
| TTX - Trados TagEditor TTX files || .ttx || <code>okf_ttx</code> || [[TTX Filter]] ||
|- valign="top"
| TXML - Wordfast Pro TXML files || .txml || <code>okf_txml</code> || [[TXML Filter]] ||
|- valign="top"
| Vignette Export/Import Content || .xml || <code>okf_vignette</code> || [[Vignette Filter]] ||
|- valign="top"
| WSXZ Package Filter || .wsxz || <code>okf_wsxzpackage</code> || [[WSXZ Package Filter]] ||
|- valign="top"
| XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| WIX (Windows Installer XML) localization files || .wix || <code>okf_xml-WixLocalization</code> || [[XML Filter]] ||
|- valign="top"
| [[XLIFF]] v1.2 || .xlf, .xliff || <code>okf_xliff</code> || [[XLIFF Filter]] ||
|- valign="top"
| [[XLIFF]] v2 || .xlf || <code>okf_xliff2</code> || [[XLIFF-2 Filter]] ||
|- valign="top"
| XML (Generic, using [[ITS]] defaults) || .xml || <code>okf_xml</code> || [[XML Filter]] ||
|- valign="top"
| XML (Generic, using stream reader) || .xml || <code>okf_xmlstream</code> || [[XML Stream Filter]] ||
|- valign="top"
| YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]] ||
|- valign="top"
| Message Format (ICU Message Format Filter) || Any container format that supports subfilters || <code>okf_messageformat</code> || [[Message Format Filter]] ||
|}

Note that most filters allow you to [[Understanding Filter Configurations|create your own configurations]] to support more file formats.

==Code Simplification Rules==

There are two levels of processing: filter and step. If possible, please always prefer the filter level, as the final merge may not always handle the applied code simplification of the step level correctly.

By default, the [[Inline Codes Simplifier Step]], [[Simplification Filter]] and [[Post-segmentation Inline Codes Removal Step]] maximize the trimming and merging (aka simplification) of inline codes. In some cases, this may not be desired. So, the <code>moveLeadingAndTrailingCodesToSkeleton</code> boolean parameter value set to <code>false</code> can turn off the trimming and the <code>mergeAdjacentCodes</code> - the merging. They are made optional for filters. This means that if they are not present (not mentioned) at the filter level, the specified step will take on the responsibility to handle them.

Besides that, the simplification rules allow you to prevent specific codes from being trimmed or merged.

===General Syntax===

The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies, it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.

For more details see the JavaCC grammar: <code>../okapi/core/src/main/javacc/SimplifierRules.jj</code>

===Rule Examples===

If Code has any of these flags then don't simplify

<pre>if DELETABLE or ADDABLE or CLONEABLE;</pre>

"=" is string match
Match basic TAGTYPE opening, closing or standalone

<pre>if DATA = "a" and TAGTYPE = OPENING;</pre>

"~" is regex match

<pre>if DATA ~ "a.*";</pre>

You can negate any of the match operators
Don't simplify if the DATA does not match the regex

<pre>if DATA !~ "a.*";</pre>

Match on type, linebreak in this case, don't simplify

<pre>if TYPE = "lb";</pre>

Don't simplify any rich text types

<pre>if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";</pre>

Expressions can be recursive (supports embedded parens)

<pre>if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));</pre>

===Filter Config Examples===

Examples of using simplifier rules within the filter config formats used by Okapi.

'''YAML:'''

<pre>
simplifierRules: |
if ADDABLE or DELETABLE or CLONEABLE;
if DATA = " " or DATA = "" or DATA = "" or DATA = "</a>";
if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";
</pre>

'''ITS:'''

<pre>
<?xml version="1.0" encoding="UTF-8"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options">

<its:translateRule selector="//*" translate="yes"/>
<its:withinTextRule selector="//codeph" withinText="yes"/>
<its:withinTextRule selector="//ph" withinText="yes"/>
<okp:simplifierRules moveLeadingAndTrailingCodesToSkeleton="yes" mergeAdjacentCodes="yes">
if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</okp:simplifierRules>
</its:rules>
</pre>

'''FPRM (Parameters):'''

<pre>
#v1
extractNotes.b=true
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</pre>

==Font Mapping==

The font mapping can be considered as a filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX, PPTX and XLSX documents) filters at the moment.

The following font mapping configuration options are available:
* The source locale regular expression pattern: <code>.*</code>, <code>en.*</code>, <code>en-UK</code>, etc. It can be ommited to apply the mapping to any source locale.
* The target locale regular expression pattern: <code>.*</code>, <code>ru.*</code>, <code>ru-RU</code>, etc. It can be ommited to apply the mapping to any target locale.
* The source font name regular expression pattern: <code>.*</code>, <code>Arial.*</code>, <code>Times New Roman</code>, etc. It can be ommited to apply the mapping to any source font name found.
* The target font name: <code>Arial</code>, <code>Times New Roman</code>, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.

Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential
substitution of the source font values. I.e. if there is more than one mapping:
# <code>Arial</code> -> <code>Times New Roman</code>
# <code>Times New Roman</code> -> <code>Sans Serif</code>
then the first mapping will produce <code>Times New Roman</code> replacement and the second one will be applied to this new value, thus, ending up with the <code>Sans Serif</code>.

The parameters serialisation format can look like that:

<pre>
fontMappings.0.sourceLocalePattern=en.*
fontMappings.0.targetLocalePattern=ru.*
fontMappings.0.sourceFontPattern=Times.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.1.sourceLocalePattern=ru
fontMappings.1.targetLocalePattern=fr
fontMappings.1.sourceFontPattern=The Sims Sans
fontMappings.1.targetFont=Arial Unicode MS
fontMappings.number.i=2
</pre>

When source locale, target locale and source font are omitted:

<pre>
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1
</pre>

And this is the same as the abovementioned:

<pre>
fontMappings.0.sourceLocalePattern=.*
fontMappings.0.targetLocalePattern=.*
fontMappings.0.sourceFontPattern=.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1
</pre>

[[Category:Filters]]

OpenXML Filter

2025-09-19T18:06:54Z

Dkonovalyenko: /* PowerPoint Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, it exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, it exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.
; Preserve ACSII and HighAnsi Font Categories On Detection
: When checked, the mentioned run font categories are preserved on the merge of consequential runs. Default: off.
; Remove Embedded Excel Package
: When checked and either cached chart strings or numbers are also set for extraction, the embedded Excel package is removed, and any references to it in chart parts and related relationships are removed as well. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Translate Graphic Name
: When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
; Translate Graphic Description
: When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation.
* If the switch is set to "Include", only text in the specified colors will be extracted for translation.
* If the switch is set to "Exclude", all content except for text in the specified colors will be extracted for translation.

Note: Text that is excluded using this mechanism will be treated as hidden; that means the "Translate Everything Hidden" options will extract it.

Note: Starting in 1.48.0, this option also applies to content in PowerPoint files.

Default: the switch is set to "Exclude" and no colors are selected, meaning that all visible content will be extracted for translation.

; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground or background color matching any of the selected colors in this option will be excluded from translation. Default: none.
:* The named colors available in the UI correspond to the standard color palette of Excel 2010.
:* The configuration itself also supports colors specified as RGB in the format <code>RRGGBB</code>, so specific colors not explicitly listed in the UI may be excluded by modifying the .fprm file by hand. For example, to exclude #69b3e7 (Pantone 292), you could modify the <code>tsExcelExcludedColors</code> section of the configuration file like this:
<pre>
tsExcelExcludedColors.i=1
ccc0=69b3e7
</pre>
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Extract Source And Target Columns Joined
: When checked, the source and target columns (cells in a row) are joined on extraction. Default: off.
; Extract Worksheets Explicitly Specified
: When checked, only worksheets that match their names in the Worksheet Configurations are exposed for extraction. Default: off.
; Extract Cells Explicitly Specified
: When checked, only cells specified in the Worksheet Configurations are exposed for extraction. The explicitly mentioned source and target columns are eligible for such handling. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration, it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options, please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Target Columns Max Characters - a list of decimal unsigned integers [0, 2^32]. When specified, the maxwidth and size-unit properties are attached to text units specified in the target columns. E.g.: <code>25,30</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Name
: When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
; Translate Graphic Description
: When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
; Translate Cached Chart Strings
: When checked, the cached chart strings are exposed for translation. Default: off.
; Translate Cached Chart Numbers
: When checked, the cached chart numbers and format codes are exposed for translation. Default: off.
; Excluded/Included Highlight Colors
: Starting in 1.48.0, the "Excluded/Included Highlight Colors" option from the Word configuration also affects PowerPoint content. See the docs in [[#Word Options]].

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2025-09-19T18:04:43Z

Dkonovalyenko: /* General Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, it exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, it exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.
; Preserve ACSII and HighAnsi Font Categories On Detection
: When checked, the mentioned run font categories are preserved on the merge of consequential runs. Default: off.
; Remove Embedded Excel Package
: When checked and either cached chart strings or numbers are also set for extraction, the embedded Excel package is removed, and any references to it in chart parts and related relationships are removed as well. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Translate Graphic Name
: When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
; Translate Graphic Description
: When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation.
* If the switch is set to "Include", only text in the specified colors will be extracted for translation.
* If the switch is set to "Exclude", all content except for text in the specified colors will be extracted for translation.

Note: Text that is excluded using this mechanism will be treated as hidden; that means the "Translate Everything Hidden" options will extract it.

Note: Starting in 1.48.0, this option also applies to content in PowerPoint files.

Default: the switch is set to "Exclude" and no colors are selected, meaning that all visible content will be extracted for translation.

; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground or background color matching any of the selected colors in this option will be excluded from translation. Default: none.
:* The named colors available in the UI correspond to the standard color palette of Excel 2010.
:* The configuration itself also supports colors specified as RGB in the format <code>RRGGBB</code>, so specific colors not explicitly listed in the UI may be excluded by modifying the .fprm file by hand. For example, to exclude #69b3e7 (Pantone 292), you could modify the <code>tsExcelExcludedColors</code> section of the configuration file like this:
<pre>
tsExcelExcludedColors.i=1
ccc0=69b3e7
</pre>
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Extract Source And Target Columns Joined
: When checked, the source and target columns (cells in a row) are joined on extraction. Default: off.
; Extract Worksheets Explicitly Specified
: When checked, only worksheets that match their names in the Worksheet Configurations are exposed for extraction. Default: off.
; Extract Cells Explicitly Specified
: When checked, only cells specified in the Worksheet Configurations are exposed for extraction. The explicitly mentioned source and target columns are eligible for such handling. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration, it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options, please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Target Columns Max Characters - a list of decimal unsigned integers [0, 2^32]. When specified, the maxwidth and size-unit properties are attached to text units specified in the target columns. E.g.: <code>25,30</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Name
: When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
; Translate Graphic Description
: When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
; Translate Cached Chart Strings
: When checked, the cached chart strings are exposed for translation. Default: off.
; Excluded/Included Highlight Colors
: Starting in 1.48.0, the "Excluded/Included Highlight Colors" option from the Word configuration also affects PowerPoint content. See the docs in [[#Word Options]].

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2025-09-19T18:04:01Z

Dkonovalyenko: /* General Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, it exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, it exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.
; Preserve ACSII and HighAnsi Font Categories On Detection
: When checked, the mentioned run font categories are preserved on the merge of consequential runs. Default: off.
; Removed Embedded Excel Package
: When checked and either cached chart strings or numbers are also set for extraction, the embedded Excel package is removed, and any references to it in chart parts and related relationships are removed as well. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Translate Graphic Name
: When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
; Translate Graphic Description
: When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation.
* If the switch is set to "Include", only text in the specified colors will be extracted for translation.
* If the switch is set to "Exclude", all content except for text in the specified colors will be extracted for translation.

Note: Text that is excluded using this mechanism will be treated as hidden; that means the "Translate Everything Hidden" options will extract it.

Note: Starting in 1.48.0, this option also applies to content in PowerPoint files.

Default: the switch is set to "Exclude" and no colors are selected, meaning that all visible content will be extracted for translation.

; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground or background color matching any of the selected colors in this option will be excluded from translation. Default: none.
:* The named colors available in the UI correspond to the standard color palette of Excel 2010.
:* The configuration itself also supports colors specified as RGB in the format <code>RRGGBB</code>, so specific colors not explicitly listed in the UI may be excluded by modifying the .fprm file by hand. For example, to exclude #69b3e7 (Pantone 292), you could modify the <code>tsExcelExcludedColors</code> section of the configuration file like this:
<pre>
tsExcelExcludedColors.i=1
ccc0=69b3e7
</pre>
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Extract Source And Target Columns Joined
: When checked, the source and target columns (cells in a row) are joined on extraction. Default: off.
; Extract Worksheets Explicitly Specified
: When checked, only worksheets that match their names in the Worksheet Configurations are exposed for extraction. Default: off.
; Extract Cells Explicitly Specified
: When checked, only cells specified in the Worksheet Configurations are exposed for extraction. The explicitly mentioned source and target columns are eligible for such handling. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration, it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options, please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Target Columns Max Characters - a list of decimal unsigned integers [0, 2^32]. When specified, the maxwidth and size-unit properties are attached to text units specified in the target columns. E.g.: <code>25,30</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Name
: When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
; Translate Graphic Description
: When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
; Translate Cached Chart Strings
: When checked, the cached chart strings are exposed for translation. Default: off.
; Excluded/Included Highlight Colors
: Starting in 1.48.0, the "Excluded/Included Highlight Colors" option from the Word configuration also affects PowerPoint content. See the docs in [[#Word Options]].

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2025-09-18T03:05:50Z

Dkonovalyenko: /* Excel Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.
; Preserve ACSII and HighAnsi Font Categories On Detection
: When checked, the mentioned run font categories are preserved on the merge of consequential runs. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Translate Graphic Name
: When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
; Translate Graphic Description
: When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation.
* If the switch is set to "Include", only text in the specified colors will be extracted for translation.
* If the switch is set to "Exclude", all content except for text in the specified colors will be extracted for translation.

Note: Text that is excluded using this mechanism will be treated as hidden; that means the "Translate Everything Hidden" options will extract it.

Note: Starting in 1.48.0, this option also applies to content in PowerPoint files.

Default: the switch is set to "Exclude" and no colors are selected, meaning that all visible content will be extracted for translation.

; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground or background color matching any of the selected colors in this option will be excluded from translation. Default: none.
:* The named colors available in the UI correspond to the standard color palette of Excel 2010.
:* The configuration itself also supports colors specified as RGB in the format <code>RRGGBB</code>, so specific colors not explicitly listed in the UI may be excluded by modifying the .fprm file by hand. For example, to exclude #69b3e7 (Pantone 292), you could modify the <code>tsExcelExcludedColors</code> section of the configuration file like this:
<pre>
tsExcelExcludedColors.i=1
ccc0=69b3e7
</pre>
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Extract Source And Target Columns Joined
: When checked, the source and target columns (cells in a row) are joined on extraction. Default: off.
; Extract Worksheets Explicitly Specified
: When checked, only worksheets that match their names in the Worksheet Configurations are exposed for extraction. Default: off.
; Extract Cells Explicitly Specified
: When checked, only cells specified in the Worksheet Configurations are exposed for extraction. The explicitly mentioned source and target columns are eligible for such handling. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration, it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options, please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Target Columns Max Characters - a list of decimal unsigned integers [0, 2^32]. When specified, the maxwidth and size-unit properties are attached to text units specified in the target columns. E.g.: <code>25,30</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Name
: When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
; Translate Graphic Description
: When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
; Translate Cached Chart Strings
: When checked, the cached chart strings are exposed for translation. Default: off.
; Excluded/Included Highlight Colors
: Starting in 1.48.0, the "Excluded/Included Highlight Colors" option from the Word configuration also affects PowerPoint content. See the docs in [[#Word Options]].

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2025-09-18T03:04:21Z

Dkonovalyenko: /* Excel Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.
; Preserve ACSII and HighAnsi Font Categories On Detection
: When checked, the mentioned run font categories are preserved on the merge of consequential runs. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Translate Graphic Name
: When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
; Translate Graphic Description
: When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation.
* If the switch is set to "Include", only text in the specified colors will be extracted for translation.
* If the switch is set to "Exclude", all content except for text in the specified colors will be extracted for translation.

Note: Text that is excluded using this mechanism will be treated as hidden; that means the "Translate Everything Hidden" options will extract it.

Note: Starting in 1.48.0, this option also applies to content in PowerPoint files.

Default: the switch is set to "Exclude" and no colors are selected, meaning that all visible content will be extracted for translation.

; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground or background color matching any of the selected colors in this option will be excluded from translation. Default: none.
:* The named colors available in the UI correspond to the standard color palette of Excel 2010.
:* The configuration itself also supports colors specified as RGB in the format <code>RRGGBB</code>, so specific colors not explicitly listed in the UI may be excluded by modifying the .fprm file by hand. For example, to exclude #69b3e7 (Pantone 292), you could modify the <code>tsExcelExcludedColors</code> section of the configuration file like this:
<pre>
tsExcelExcludedColors.i=1
ccc0=69b3e7
</pre>
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Extract Source And Target Columns Joined
: When checked, the source and target columns (cells in a row) are joined on extraction. Default: off.
; Extract Worksheets Explicitly Specified
: When checked, only worksheets that match their names in the Worksheet Configurations are exposed for extraction. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
; Extract Cells Explicitly Specified
: When checked, only cells specified in the Worksheet Configurations are exposed for extraction. The explicitly mentioned source and target columns are eligible for such handling. Default: off.
: For one configuration, it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options, please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Target Columns Max Characters - a list of decimal unsigned integers [0, 2^32]. When specified, the maxwidth and size-unit properties are attached to text units specified in the target columns. E.g.: <code>25,30</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Name
: When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
; Translate Graphic Description
: When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
; Translate Cached Chart Strings
: When checked, the cached chart strings are exposed for translation. Default: off.
; Excluded/Included Highlight Colors
: Starting in 1.48.0, the "Excluded/Included Highlight Colors" option from the Word configuration also affects PowerPoint content. See the docs in [[#Word Options]].

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

IDML Filter

2025-08-25T20:38:57Z

Dkonovalyenko: /* Parameters */

{{Filters Header}}
==Overview==

This filter allows you to process IDML documents. IDML (InDesign Markup Language) is an XML-based format, introduced in Adobe InDesign CS4, for representing InDesign content. IDML is used in several InDesign and InCopy file types. The specification can be found [http://www.adobe.com/content/dam/Adobe/en/devnet/indesign/cs5_docs/idml/idml-specification.pdf on the Adobe Web site].

==Processing Details==

When processing an IDML filter, the filter looks at all the spreads in the document, and for each of them, gather the list of the stories used in <code><TextFrame></code> and <code><TextPath></code>. The text is extracted by spread, and for each spread by story in the order the appear in the spread.

Stories embedded inside other stories and not declared at a spread level are extracted in a special group.

==Parameters==

<cite>Maximum attribute size</cite> — Set the size in MB for the attribute buffer. The default is 4MB (4 * 1024 * 1024)

<cite>Special character pattern</cite> — (default is "<code> | | | | | | | | | ||‌||‑|</code>"). A matched content is treated as inline code.

<cite>Untag XML Structures</cite> — Set this option to skip embedded XML structural information when extracting translatable content.

<cite>Merge Adjacent Codes</cite> — Set this option to merge inline adjacent codes. (default is false)

<cite>Extract notes</cite> — Set this option to extract the content of notes (<code><Note></code> elements).

<cite>Extract master spreads</cite> — Set this option to extract the content of the master spreads if they exist. If this option is not set only the normal spreads are extracted.

<cite>Extract hidden layers</cite> — Set this option to extract also the hidden layers.

<cite>Extract hidden pasteboard items</cite> — (default is false)

<cite>Skip discretionary hyphens</cite> — (default is false)

<cite>Extract breaks inline</cite> — (default is false)

<cite>Extract hyperlink text sources inline</cite> — (default is false). When it is set to true, the hyperlink text sources are extracted inline, otherwise, they are represented as referencing groups of textual units.

<cite>Extract custom text variables</cite> — (default is false)

<cite>Extract index topics</cite> — (default is false)

<cite>Extract external hyperlinks</cite> — (default is false). When it is set to true, the external hyperlinks are extracted for translation.

<cite>Extract Math Zones</cite> — (default is true). When it is set to true, the math zones are extracted for translation.

<cite>Excluded Styles</cite> — Content with the specified styles is excluded from extraction.

<cite>Ignore character kerning</cite> — (default is false)

<cite>Ignore character tracking</cite> — (default is false)

<cite>Ignore character leading</cite> — (default is false)

<cite>Ignore character baseline shift</cite> — (default is false)

==Deprecated Parameters==

Prior to release M34, the filter supported several additional parameters. The behavior of these has been subsumed by the more intelligent content processing performed by the updated version of the filter in versions M34 and later.

<cite>Simplify inline codes when possible</cite> — Set this option to reduce the number of inline codes by re-grouping adjacent codes when it is possible.

<cite>Create new text units on hard returns</cite> — Set this option to create separate text units when a hard return element (<code> </code>) is found. '''IMPORTANT: This option is not completed yet. Setting it may create extracted documents you will not be able to merge back. Always test merge before use this for production.'''

<cite>Maximum spread size</cite> — Set the maximum size for the spread files (in KBytes). Any spread file above the given value will either generate an error or will be skipped from extraction depending on the specified option. This allows you to skip over large spread files that may contain only graphics and require too much memory to be opened. Note that the skipped file are not checked for translatable text.

<cite>Generate an error when a spread is larger than the specified value</cite> — Set this option to generate an error if a spread size is above the specified <cite>Maximum spread size</cite>. If this option is not set, the spread is skipped with a warning message.

[[Category:Filters]]

IDML Filter

2025-08-25T20:37:30Z

Dkonovalyenko: /* Parameters */

{{Filters Header}}
==Overview==

This filter allows you to process IDML documents. IDML (InDesign Markup Language) is an XML-based format, introduced in Adobe InDesign CS4, for representing InDesign content. IDML is used in several InDesign and InCopy file types. The specification can be found [http://www.adobe.com/content/dam/Adobe/en/devnet/indesign/cs5_docs/idml/idml-specification.pdf on the Adobe Web site].

==Processing Details==

When processing an IDML filter, the filter looks at all the spreads in the document, and for each of them, gather the list of the stories used in <code><TextFrame></code> and <code><TextPath></code>. The text is extracted by spread, and for each spread by story in the order the appear in the spread.

Stories embedded inside other stories and not declared at a spread level are extracted in a special group.

==Parameters==

<cite>Maximum attribute size</cite> — Set the size in MB for the attribute buffer. The default is 4MB (4 * 1024 * 1024)

<cite>Special character pattern</cite> — (default is "<code> | | | | | | | | | ||‌||‑|</code>"). A matched content is treated as inline code.

<cite>Untag XML Structures</cite> — Set this option to skip embedded XML structural information when extracting translatable content.

<cite>Merge Adjacent Codes</cite> — Set this option to merge inline adjacent codes. (default is false)

<cite>Extract notes</cite> — Set this option to extract the content of notes (<code><Note></code> elements).

<cite>Extract master spreads</cite> — Set this option to extract the content of the master spreads if they exist. If this option is not set only the normal spreads are extracted.

<cite>Extract hidden layers</cite> — Set this option to extract also the hidden layers.

<cite>Extract hidden pasteboard items</cite> — (default is false)

<cite>Skip discretionary hyphens</cite> — (default is false)

<cite>Extract breaks inline</cite> — (default is false)

<cite>Extract hyperlink text sources inline</cite> — (default is false). When it is set to true, the hyperlink text sources are extracted inline, otherwise, they are represented as referencing groups of textual units.

<cite>Extract custom text variables</cite> — (default is false)

<cite>Extract index topics</cite> — (default is false)

<cite>Extract external hyperlinks</cite> — (default is false). When it is set to true, the external hyperlinks are extracted for translation.

<cite>Extract Math Zones</cite> — (default is true). When it is set to true, the math zones are extracted for translation.

<cite>Excluded Styles</cite> — The specified styles are excluded from extraction.

<cite>Ignore character kerning</cite> — (default is false)

<cite>Ignore character tracking</cite> — (default is false)

<cite>Ignore character leading</cite> — (default is false)

<cite>Ignore character baseline shift</cite> — (default is false)

==Deprecated Parameters==

Prior to release M34, the filter supported several additional parameters. The behavior of these has been subsumed by the more intelligent content processing performed by the updated version of the filter in versions M34 and later.

<cite>Simplify inline codes when possible</cite> — Set this option to reduce the number of inline codes by re-grouping adjacent codes when it is possible.

<cite>Create new text units on hard returns</cite> — Set this option to create separate text units when a hard return element (<code> </code>) is found. '''IMPORTANT: This option is not completed yet. Setting it may create extracted documents you will not be able to merge back. Always test merge before use this for production.'''

<cite>Maximum spread size</cite> — Set the maximum size for the spread files (in KBytes). Any spread file above the given value will either generate an error or will be skipped from extraction depending on the specified option. This allows you to skip over large spread files that may contain only graphics and require too much memory to be opened. Note that the skipped file are not checked for translatable text.

<cite>Generate an error when a spread is larger than the specified value</cite> — Set this option to generate an error if a spread size is above the specified <cite>Maximum spread size</cite>. If this option is not set, the spread is skipped with a warning message.

[[Category:Filters]]

IDML Filter

2025-08-25T20:36:00Z

Dkonovalyenko: /* Parameters */

{{Filters Header}}
==Overview==

This filter allows you to process IDML documents. IDML (InDesign Markup Language) is an XML-based format, introduced in Adobe InDesign CS4, for representing InDesign content. IDML is used in several InDesign and InCopy file types. The specification can be found [http://www.adobe.com/content/dam/Adobe/en/devnet/indesign/cs5_docs/idml/idml-specification.pdf on the Adobe Web site].

==Processing Details==

When processing an IDML filter, the filter looks at all the spreads in the document, and for each of them, gather the list of the stories used in <code><TextFrame></code> and <code><TextPath></code>. The text is extracted by spread, and for each spread by story in the order the appear in the spread.

Stories embedded inside other stories and not declared at a spread level are extracted in a special group.

==Parameters==

<cite>Maximum attribute size</cite> — Set the size in MB for the attribute buffer. The default is 4MB (4 * 1024 * 1024)

<cite>Special character pattern</cite> — (default is "<code> | | | | | | | | | ||‌||‑|</code>"). A matched content is treated as inline code.

<cite>Untag XML Structures</cite> — Set this option to skip embedded XML structural information when extracting translatable content.

<cite>Merge Adjacent Codes</cite> — Set this option to merge inline adjacent codes. (default is false)

<cite>Extract notes</cite> — Set this option to extract the content of notes (<code><Note></code> elements).

<cite>Extract master spreads</cite> — Set this option to extract the content of the master spreads if they exist. If this option is not set only the normal spreads are extracted.

<cite>Extract hidden layers</cite> — Set this option to extract also the hidden layers.

<cite>Extract hidden pasteboard items</cite> — (default is false)

<cite>Skip discretionary hyphens</cite> — (default is false)

<cite>Extract breaks inline</cite> — (default is false)

<cite>Extract hyperlink text sources inline</cite> — (default is false). When it is set to true, the hyperlink text sources are extracted inline, otherwise, they are represented as referencing groups of textual units.

<cite>Extract custom text variables</cite> — (default is false)

<cite>Extract index topics</cite> — (default is false)

<cite>Extract external hyperlinks</cite> — (default is false). When it is set to true, the external hyperlinks are extracted for translation.

<cite>Extract Math Zones</cite> — (default is true). When it is set to true, the math zones are extracted for translation.

<cite>Excluded Styles</cite> — It is possible to add, remove or edit the excluded for extraction styles. The specified styles are excluded from extraction.

<cite>Ignore character kerning</cite> — (default is false)

<cite>Ignore character tracking</cite> — (default is false)

<cite>Ignore character leading</cite> — (default is false)

<cite>Ignore character baseline shift</cite> — (default is false)

==Deprecated Parameters==

Prior to release M34, the filter supported several additional parameters. The behavior of these has been subsumed by the more intelligent content processing performed by the updated version of the filter in versions M34 and later.

<cite>Simplify inline codes when possible</cite> — Set this option to reduce the number of inline codes by re-grouping adjacent codes when it is possible.

<cite>Create new text units on hard returns</cite> — Set this option to create separate text units when a hard return element (<code> </code>) is found. '''IMPORTANT: This option is not completed yet. Setting it may create extracted documents you will not be able to merge back. Always test merge before use this for production.'''

<cite>Maximum spread size</cite> — Set the maximum size for the spread files (in KBytes). Any spread file above the given value will either generate an error or will be skipped from extraction depending on the specified option. This allows you to skip over large spread files that may contain only graphics and require too much memory to be opened. Note that the skipped file are not checked for translatable text.

<cite>Generate an error when a spread is larger than the specified value</cite> — Set this option to generate an error if a spread size is above the specified <cite>Maximum spread size</cite>. If this option is not set, the spread is skipped with a warning message.

[[Category:Filters]]

OpenXML Filter

2025-08-25T20:20:02Z

Dkonovalyenko: /* Parameters */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.
; Preserve ACSII and HighAnsi Font Categories On Detection
: When checked, the mentioned run font categories are preserved on the merge of consequential runs. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Translate Graphic Name
: When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
; Translate Graphic Description
: When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation.
* If the switch is set to "Include", only text in the specified colors will be extracted for translation.
* If the switch is set to "Exclude", all content except for text in the specified colors will be extracted for translation.

Note: Text that is excluded using this mechanism will be treated as hidden; that means the "Translate Everything Hidden" options will extract it.

Note: Starting in 1.48.0, this option also applies to content in PowerPoint files.

Default: the switch is set to "Exclude" and no colors are selected, meaning that all visible content will be extracted for translation.

; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground or background color matching any of the selected colors in this option will be excluded from translation. Default: none.
:* The named colors available in the UI correspond to the standard color palette of Excel 2010.
:* The configuration itself also supports colors specified as RGB in the format <code>RRGGBB</code>, so specific colors not explicitly listed in the UI may be excluded by modifying the .fprm file by hand. For example, to exclude #69b3e7 (Pantone 292), you could modify the <code>tsExcelExcludedColors</code> section of the configuration file like this:
<pre>
tsExcelExcludedColors.i=1
ccc0=69b3e7
</pre>
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Extract Source And Target Columns Joined
: When checked, the source and target columns (cells in a row) are joined on extraction. Default: off.
; Extract Worksheets Explicitly Specified
: When checked, only worksheets that match their names in the Worksheet Configurations are exposed for extraction. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration, it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Target Columns Max Characters - a list of decimal unsigned integers [0, 2^32]. When specified, the maxwidth and size-unit properties are attached to text units specified in the target columns. E.g.: <code>25,30</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Name
: When checked, @name attribute values associated with drawings and word art are exposed for translation. Default: on.
; Translate Graphic Description
: When checked, @descr attribute values associated with drawings and word art are exposed for translation. Default: off.
; Translate Cached Chart Strings
: When checked, the cached chart strings are exposed for translation. Default: off.
; Excluded/Included Highlight Colors
: Starting in 1.48.0, the "Excluded/Included Highlight Colors" option from the Word configuration also affects PowerPoint content. See the docs in [[#Word Options]].

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2025-03-07T16:44:57Z

Dkonovalyenko: /* Excel Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Exclude Graphical Metadata
: When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Extract Source And Target Columns Joined
: When checked, the source and target columns (cells in a row) are joined on extraction. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Target Columns Max Characters - a list of decimal unsigned integers [0, 2^32]. When specified, the maxwidth and size-unit properties are attached to text units specified in the target columns. E.g.: <code>25,30</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Metadata
: When checked, the graphic metadata (@name and @descr attribute values) are exposed for translation. Default: off.

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2025-02-03T16:00:11Z

Dkonovalyenko: /* Excel Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Exclude Graphical Metadata
: When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Target Columns Max Characters - a list of decimal unsigned integers [0, 2^32]. When specified, the maxwidth and size-unit properties are attached to text units specified in the target columns. E.g.: <code>25,30</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Metadata
: When checked, the graphic metadata (@name and @descr attribute values) are exposed for translation. Default: off.

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2025-01-15T14:33:12Z

Dkonovalyenko: /* Excel Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Exclude Graphical Metadata
: When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Max Characters - decimal unsigned integer [0, 2^32] or an empty string. When specified, the maxwidth and size-unit properties are attached to text units. Default: empty string. E.g.: <code>25</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Metadata
: When checked, the graphic metadata (@name and @descr attribute values) are exposed for translation. Default: off.

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2025-01-15T14:32:16Z

Dkonovalyenko: /* Excel Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Exclude Graphical Metadata
: When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Max Characters - decimal unsigned integer [0, 2^32] or an empty string. When specified, the maxwidth and size-unit properties are attached to text units. E.g.: <code>25</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Metadata
: When checked, the graphic metadata (@name and @descr attribute values) are exposed for translation. Default: off.

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2025-01-15T14:31:46Z

Dkonovalyenko: /* Excel Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Exclude Graphical Metadata
: When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Max Characters - decimal unsigned integer [0, 2^32] or an empty string. When specified, the maxwidth and size-unit properties are attached to text units. E.g.: <code>25</code>
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Metadata
: When checked, the graphic metadata (@name and @descr attribute values) are exposed for translation. Default: off.

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2025-01-15T14:31:09Z

Dkonovalyenko: /* Excel Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Exclude Graphical Metadata
: When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Max Characters - positive decimal unsigned integer [0, 2^32] or an empty string. When specified, the maxwidth and size-unit properties are attached to text units. E.g.: <code>25</code>
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Metadata
: When checked, the graphic metadata (@name and @descr attribute values) are exposed for translation. Default: off.

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2025-01-14T13:45:44Z

Dkonovalyenko: /* Excel Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Exclude Graphical Metadata
: When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Max Characters - positive decimal integer [0, 2^31-1] or an empty string, which means that the limitation is not applied. Note: it works for plain text only, inline styles are not supported.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Metadata
: When checked, the graphic metadata (@name and @descr attribute values) are exposed for translation. Default: off.

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2024-11-29T15:04:03Z

Dkonovalyenko: /* PowerPoint Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Exclude Graphical Metadata
: When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.
; Translate Graphic Metadata
: When checked, the graphic metadata (@name and @descr attribute values) are exposed for translation. Default: off.

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2024-11-29T14:56:13Z

Dkonovalyenko: /* General Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.
; Ignore Whitespace Styles
: When checked under the "Clean Tags Aggressively", the whitespace character styles (formatting) are ignored and considered equal to the consequential ones. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Exclude Graphical Metadata
: When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2024-10-02T14:51:20Z

Dkonovalyenko: /* Excel Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Exclude Graphical Metadata
: When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Preserve Styles In Target Columns
: When checked, the cell styles in target columns are preserved. Default: off.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2024-07-03T15:57:39Z

Dkonovalyenko: /* Word Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Exclude Graphical Metadata
: When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.
; Allow Style Optimisation
: When checked, the optimisation of styles is allowed - common formatting of all runs in a paragraph is moved to the styles part. Default: on.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2023-12-27T09:37:26Z

Dkonovalyenko: /* Excel Options */

{{Filters Header}}
==Overview==

This filter allows you to process the different types of documents of the Microsoft Office suite from 2007 and later, such as DOCX (text documents), XLSX (spreadsheets) and PPTX (presentations). These documents are based on the OpenXML format, opposed to the binary formats used by pre-2007 versions of Office.

==Parameters==

The filter parameters are divided into '''General Options''', which apply to all formats, and format-specific options.

===General Options===
; Translate Document Properties
: When checked, exposes the following document properties for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Translate Comments
: When checked, exposes document comments for translation. Default: on.
; Clean Tags Aggressively
: When checked, strips additional formatting tags related to text spacing. This is meant to improve filtering in cases where Office documents were converted from other formats (in particular, PDF), and imperfect conversion added a lot of extra formatting noise. Default: off.

=== Word Options ===
; Translated Headers and Footers
: When checked, exposes header and footer content for translation. Default: on.
; Translate Numbering Level Text
: When checked, exposes numbering-level text for translation. Default: off.
; Translated Hidden Text
: When checked, exposes hidden text for translation. Default: on.
; Exclude Graphical Metadata
: When not checked, labels associated with drawings and word art are exposed for translation. When checked, these labels (which are frequently not displayed in the document) are suppressed. Default: off.
; Ignored Styles > Ignore Font Colours
: When checked, font colours will be ignored. Default: off.
: If <cite>Clean Tags Aggressively</cite> and this option are checked and the ignorance thresholds are empty, the font colour run properties are removed from the document structure on filtering. This means that the font colour information is absent on merge as well.
; Ignored Styles > Font Colours Minimum Ignorance Threshold
: When defined, font colours will be ignored starting from the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: black, Black, 000000 - thresholds in white. Default: none.
; Ignored Styles > Font Colours Maximum Ignorance Threshold
: When defined, font colours will be ignored ending by the specified value. It can be empty (considered as a white colour by default), and contain preset colour values or RGB hex strings: white, White, FFFFFF - thresholds in white. Default: none.
; Excluded/Included Styles
: Depending on the radio switch (exclude or include), text using any selected styles will be excluded or included for translation. Default: none.
; Excluded/Included Highlight Colors
: Depending on the radio switch (exclude or include), text using any selected colours will be excluded or included for translation. Default: none.
; Excluded Font Colours
: Text using any selected colours will not be exposed for translation. Default: none.

=== Excel Options ===
; Translate Hidden Rows and Columns
: When checked, hidden rows and columns are exposed for translation. Default: off.
; Colors to Exclude
: Text with a foreground color matching any of the selected colors in this option will be excluded from translation. These colors correspond to the standard color palette of Excel 2010. The configuration itself stores these values as RGB, so specific colors not explicitly listed here may be excluded by modifying the .fprm file by hand. Default: none.
; Translate Cells Copied
: When checked, cell data are copied on extraction to allow contextualised and independent translations. Default: on.
; Worksheet Configurations
: The list of configurations spotting the exclusion from translation rows and/or columns and/or marking such rows and/or columns as metadata per a worksheet name pattern.
: For one configuration it is possible to specify:
:* Name Pattern - a regular expression, by which all other operations are matched and applied. For formatting options please refer to <code>java.util.regex.Pattern</code>. E.g.: <code>Sheet1</code>.
:* Source Columns - a list of ALPHA-26 numbers, specifying columns that are copied over the target ones for translation/extraction. E.g.: <code>A,B</code>.
:* Target Columns - a list of ALPHA-26 numbers, specifying columns that are overwritten by the source ones for translation/extraction. E.g.: <code>C,D</code>.
:* Excluded Rows - a list of integers, pointing out row numbers that are excluded from translation/extraction. E.g.: <code>1,2</code>.
:* Excluded Columns - a list of ALPHA-26 numbers, specifying columns that are excluded from translation/extraction. E.g.: <code>A,B</code>.
:* Metadata Rows - a list of integers, pointing out row numbers that are treated and extracted as metadata. E.g.: <code>3,4</code>.
:* Metadata Columns - a list of ALPHA-26 numbers, specifying columns that are treated and extracted as metadata. E.g.: <code>C,D</code>.
: Let's consider a simple table as an example and find out what can be done with all those configurations.
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || B3 || C3 || Metadata D3
|-
| A4 || B4 || C4 || Metadata D4
|-
| A5 || B5 || C5 || Metadata D5
|}
: Firstly, let's suppose we would like to translate column A only and place the translation in column B. At the same time we do not want to translate the 1st and the 2nd rows.
: This requirement can be configured in the following way (using the <code>net.sf.okapi.common.ParametersString</code> format as an example):
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.sourceColumns=A
worksheetConfigurations.0.targetColumns=B
worksheetConfigurations.0.excludedRows=1,2
worksheetConfigurations.0.excludedColumns=C,D
</pre>
: Then the XLIFF would look like this after extraction and translation:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<trans-unit id="P147242AB-tu1" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es">A3-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<trans-unit id="P147242AB-tu2" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es">A4-tr</target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5"
<trans-unit id="P147242AB-tu3" resname="Sheet1!B5" xml:space="preserve">
<source xml:lang="en">A5</source>
<target xml:lang="es">A5-tr</target>
</trans-unit>
</group>
</group>
</pre>
: And the merged representation would be the following:
{| class="wikitable" style="margin:auto"
|-
! colspan="2"|Metadata Header A1 !! colspan="2"|Metadata Header C1
|-
! Metadata Header A2 !! Metadata Header B2 || Metadata Header C2 !! Metadata Header D2
|-
| A3 || A3-tr || C3 || Metadata D3
|-
| A4 || A4-tr || C4 || Metadata D4
|-
| A5 || A5-tr || C5 || Metadata D5
|}

: Furthermore, let's suppose we would like to translate columns A and B, and treat column D as metadata for each of the translatable cell in a row. At the same time, we would like to consider the 1st and 2nd rows as metadata about the metadata in columns. And, we would like not to extract the 5th row.
: All these requirements can be written as the following configurations:
<pre>
worksheetConfigurations.number.i=1
worksheetConfigurations.0.namePattern=Sheet1
worksheetConfigurations.0.excludedRows=5
worksheetConfigurations.0.excludedColumns=C
worksheetConfigurations.0.metadataRows=1,2
worksheetConfigurations.0.metadataColumns=D
</pre>
: Then, the extraction to XLIFF should look like that:
<pre>
<group id="P76C545-sg1" resname="Sheet1">
<group id="P132303AB-sg1" resname="1">
</group>
<group id="P132303AB-sg2" resname="2">
</group>
<group id="P132303AB-sg3" resname="3">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D3</context>
</context-group>
<trans-unit id="P147242AB-tu1" resname="Sheet1!A3" xml:space="preserve">
<source xml:lang="en">A3</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu2" resname="Sheet1!B3" xml:space="preserve">
<source xml:lang="en">B3</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg4" resname="4">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D4</context>
</context-group>
<trans-unit id="P147242AB-tu3" resname="Sheet1!A4" xml:space="preserve">
<source xml:lang="en">A4</source>
<target xml:lang="es"></target>
</trans-unit>
<trans-unit id="P147242AB-tu4" resname="Sheet1!B4" xml:space="preserve">
<source xml:lang="en">B4</source>
<target xml:lang="es"></target>
</trans-unit>
</group>
<group id="P132303AB-sg5" resname="5">
<context-group name="row-metadata">
<context context-type="x-Metadata Header C1;Metadata Header D2">Metadata D5</context>
</context-group>
</group>
</group>
</pre>

=== PowerPoint Options ===
; Translate Document Properties
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the following document properties are exposed for translation: title, subject, creator, description, category, keywords, content status. Default: on.
; Reorder Document Properties
: When checked, the document properties are reordered and placed after the root relationship part (_rels/.rels). Default: off.
; Reorder Relationships
: When checked, the relationship parts are reordered and placed after the related slide or layout or master part. Default: off.
; Translate Diagram Data
: When checked, the diagram data are exposed for translation. Default: on.
; Reorder Diagram Data
: When checked, the diagram data parts are reordered and placed after the related slide or layout or master part and after their relationship parts. Default: off.
; Translate Charts
: When checked, the charts are exposed for translation. Default: on.
; Reorder Charts
: When checked, the chart parts are reordered and placed after the related slide or layout or master part and after their diagram data parts. Default: off.
; Translate Notes
: When checked, the slide notes exposed for translation. Default: off.
; Reorder Notes
: When checked, the note parts are reordered and placed after the related slide part and after its chart parts. Default: off.
; Translate Comments
: When checked and the same option is checked under '''the Gereral Options''' (''they will be separated after the next release''), the document comments are exposed for translation. Default: on.
; Reorder Comments
: When checked, the comment parts are reordered and placed after the related slide part and after its note parts. Default: off.
; Translate Masters
: When checked, expose slide masters and notes masters for translation. This will also expose for translation content from layouts that are currently in use by at least one slide. Default: on.

==Limitations==

* Various, see [https://bitbucket.org/okapiframework/okapi/issues?status=new&title=~OpenXML the issues list].

[[Category:Filters]]

OpenXML Filter

2023-12-14T14:18:20Z

Dkonovalyenko: /* Excel Options */

XML Filter

2023-08-16T16:34:01Z

Dkonovalyenko: /* Filter Options */

{{Filters Header}}
==Overview==

This filter allows you to process XML documents. It uses a DOM-based parser, which allows it to implement [[ITS]]. If you need to process very large XML documents and have no need for ITS, you may want to look at using the [[XML Stream Filter]].

The following is an example of a simple XML document. The translatable text is highlighted. Because each format based on XML is different, you need information on what are the translatable parts, what are the inline elements, etc. The XML Filter [[#ITS Support|implements the ITS W3C Recommendation]] to address this issue.

<?xml version="1.0" encoding="utf-8"?>
<myDoc>
<prolog>
<author>Zebulon Fairfield</author>
<version>version 12, revision 2 - 2006-08-14</version>
<keywords><kw>horse</kw><kw>appaloosa</kw></keywords>
<storageKey>articles-6D272BA9-3B89CAD8</storageKey>
</prolog>
<body>
<title>Appaloosa</title>
The Appaloosas are rugged horses originally breed by
the <kw>Nez-Perce</kw> tribe in the US Northwest.
They are often characterized by their spotted coats.
</body>
</myDoc>

This filter is implemented in the class <code>net.sf.okapi.filters.xml.XMLFilter</code> of the library.

==Processing Details==

===Input Encoding===

The filter decides which encoding to use for the input document using the following logic:

* If the document has an encoding declaration it is used.
* Otherwise, UTF-8 is used as the default encoding (regardless the actual default encoding that was specified when opening the document).

===Output Encoding===

If the output encoding is UTF-8:

* If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
* If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

If the original document had an XML encoding declaration it is updated, if it did not, one is automatically added.

===Line-Breaks===

The type of line-breaks of the output is the same as the one of the original input.

==Parameters==

This filter stores its parameters in an XML file and does not provide an editor to modify it. You can edit the file in a simple text editor, or with an XML editor. For an example, see the article "[[How to Create a Custom Configuration for the XML Filter]]".

===ITS Support===

By default the filter process the XML documents based on the '''ITS defaults'''. That is:

* the content of all elements is translatable,
* and none of the values of the attribute translatable.

Different behavior can occur if the input document contains ITS markup, or if a filter parameters file is specified. The parameters file used by the the XML Filter is [[ITS|an ITS document]].

The '''Internationalization Tag set (ITS)''' is a W3C recommendation that defines a set of elements and attributes you can use to specify different internationalization- and localization-related aspects of your XML document, for instance: ITS defines what attribute values are translatable, what element content should be protected, what element should be treated as a nested sub-flow of text, and much more.

The filter supports ITS 1.0 and ITS 2.0 (2.0 is backward compatible with 1.0)

* The ITS 1.0 specification is available at http://www.w3.org/TR/its/.
* The ITS 2.0 specification is available at http://www.w3.org/TR/its20/.

See the "[[ITS]]" page for more details on the format.

The filter supports global and local rules and most data categories. See the '''[[ITS Components]]''' page for a detailed list of how the data categories are supported and other information on the implementation.

===ITS Extensions===

The filter supports extensions to the ITS specification. These extension use the namespace URI http://www.w3.org/2008/12/its-extensions.

* [[#idValue and xml:id|idValue and xml:id]]
* [[#whiteSpaces|whiteSpaces]]

====idValue and xml:id====

{{NoteBox|This extension was defined for ITS 1.0, ITS 2.0 offers the new [http://www.w3.org/TR/its20/#idvalue Id Value] data category that should be used instead of this extension.}}

When the attribute <code>xml:id</code> is found on a translatable element, it is used as the name of the text unit generated for that element.

For example, in the example below, the resource name associated with the text unit for the <code></code> element is "<code>id1</code>".

Text

The attribute <code>idValue</code> used in the ITS <code>translateRule</code> element allows you to define an XPath expression that correspeonds to the identifier value for the given selection. The value of <code>idValue</code> must be an expression that can return a string. A node location is a valid expression: it will return the value of the first node at the given location.

For example, in the example below, the resource name associated with the text unit for the <code></code> element is "<code>id1</code>":

<pre><doc>
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
<its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/>
</its:rules>
text 1
</doc></pre>

Note that <code>xml:id</code> has precedence over <code>idValue</code> declaration. For example, in the example below, the resource name associated with the text unit for the <code></code> element is "<code>xid1</code>", not "<code>id1</code>".

<pre><doc>
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
<its:translateRule selector="//p" translate="yes" itsx:idValue="@name"/>
</its:rules>
text 1
</doc></pre>

You can build complex ID based on different attributes, element or event hard-coded text. Any of the String functions offered by XPath can be used.

For example, in the file below, the two elements <code>&tl;text></code> and <code><desc></code> are translatable, but they have only one corresponding ID, the <code>name</code> attribute in their parent element. To make sure you have a unique identifier for both the content of <code><text></code> and the content of <code><desc></code>, you can use the rules set in the example. The XPath expression "<code>concat(../@name, '_t')</code>" will give the ID "<code>id1_t</code>" and the expression "<code>concat(../@name, '_d')</code>" will give the ID "<code>id1_d</code>".

<pre><doc>
<its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions">
<its:translateRule selector="//text" translate="yes" itsx:idValue="concat(../@name, '_t')"/>
<its:translateRule selector="//desc" translate="yes" itsx:idValue="concat(../@name, '_d')"/>
</its:rules>
<msg name="id1">
<text>Value of text</text>
<desc>Value of desc</desc>
</msg>
</doc></pre>

====whiteSpaces====

{{NoteBox|This extension was defined for ITS 1.0, ITS 2.0 offers the new [http://www.w3.org/TR/its20/#preservespace Preserve Space] data category that should be used instead of this extension.}}

The extension attribute whiteSpaces allows you to apply globally the equivalent of a local <code>xml:space</code> attribute.

For example, if you have a format where all element <code><pre></code> must have their spaces, tabs and line breaks preserved, you can specify the attribute <code>whiteSpaces="preserve"</code> in a <code><its:translateRule></code> element for the <code><pre></code> elements. In the example below, the spaces in the <code><pre></code> element will be preserved on extraction.

<doc>
<nowiki><its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions"></nowiki>
<its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/>
</its:rules>
<pre>Some txt with many spaces. </pre>
</doc>

Note that the <code>xml:space</code> attribute has precedence over <code>whiteSpaces</code>. For example, in the following example, the white spaces in the content of <code><pre></code> may '''not''' be preserved because the attribute <code>xml:space</code> has the value <code>default</code>:

<doc>
<nowiki><its:rules version="1.0" xmlns:its="http://www.w3.org/2005/11/its"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions"></nowiki>
<its:translateRule selector="//pre" translate="yes" itsx:whiteSpaces="preserve"/>
</its:rules>
&<pre xml:space="default">Some txt with many spaces. </pre>
</doc>

===Filter Options===

The filter supports also options in addition to ITS and ITS extension. These options use the namespace URI <code>okapi-framework:xmlfilter-options</code>.

{{NoteBox|The filter options must be placed in the parameters file (.fprm) used with the filter, not in embedded or linked ITS rules. Options placed in embedded or linked ITS rules have no effect.}}

When you use several options, they must be set in a single <code><okp:options></code> element, as shown below:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options lineBreakAsCode="yes"
escapeQuotes="no"
escapeGT="yes"
/>
</its:rules></pre>

The following options are available:

* [[#lineBreakAsCode|lineBreakAsCode]]
* [[#codeFinder|codeFinder]]
* [[#omitXMLDeclaration|omitXMLDeclaration]]
* [[#escapeQuotes|escapeQuotes]]
* [[#escapeGT|escapeGT]]
* [[#escapeNbsp|escapeNbsp]]
* [[#extractIfOnlyCodes|extractIfOnlyCodes]]
* [[#inlineCdata|inlineCdata]]
* [[#extractUntranslatable|extractUntranslatable]]

====lineBreakAsCode====

In some cases the content of element includes line-breaks that need to be included as part of the content but without using an actual line-break in the extracted text. For example in some XML documents generated by Excel, the formatting of the cells is marked up with <code>&#10;</code> entity references. They need to be passed as inline codes.

By default this option is set to false.

To specify this the filter use the extension <code>lineBreakAsCode</code> extension attribute. This affect all the extracted content.

For example: The following code is an ITS document with the option to treat line-breaks as code. It can be used along with the example of XML document listed below.

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options lineBreakAsCode="yes"/>
</its:rules></pre>

<doc>
<data>line 1&#10;line 2.</data>
</doc>

====codeFinder====

You can define a set of regular expressions to capture span of extracted text that should be treated as inline codes. For example, some element content may have variables, or HTML tags that need to be protected from modification and treated as codes. Use the codeFinder element for this.

In the following parameters file, the <code>codeFinder</code> element defines two rules:

* The first one (rule0) is "<code><(/?)\w[^>]*?></code>" and matches any XML-type tags (e.g. "<code></code>", "<code></code>", "<code> </code>")
* The second one (rule1) is "<code>(#\w+?\#)|(%\d+?%)</code>" and matches any word enclosed in <code>#</code> (e.g. "<code>#VAR#</code>") or number enclosed in <code>%</code> (e.g. "<code>%1%</code>").

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:codeFinder useCodeFinder="yes">#v1
count.i=2
rule0=&lt;(/?)\w+[^&gt;]*?&gt;
rule1=(#\w+?\#)|(%\d+?%)
</okp:codeFinder>
</its:rules></pre>

Some important details:

* Set <code>useCodeFinder</code> to "yes" to have the rules used, if the attribute is missing its value is assumed to be "no".
* Make sure the first line of the <code><codeFinder></code> element content is <code>#v1</code>.
* Each entry in the content must be on a separate line.
* <code>count.i=N</code> must be before any rules and <code>N</code> must be the number of rules.
* <code>ruleN</code> must be incremented starting at 0.
* The pattern for a rule must be escaped for XML, for example: "<code><(/?)\w[^>]*?></code>" must be entered "<code>&lt;(/?)\w[^&lt;]*?&gt;</code>" in the parameters file.
* Do not put spaces before <code>count.i</code> or <code>ruleN</code>, and not after your expressions.

To facilitate the creation of code finder rules [[Rainbow - Code Finder Editor|Rainbow provides the Code Finder Editor]].

====omitXMLDeclaration====

By default an XML declaration is always set at the top of the output document (regardless wether the original document has one or not). It is an important part of the XML document and it is especially needed when the encoding of the output document is not UTF-8, UTF-16 or UTF-32, as its name must be specified in the XML declaration. However, there are a few special cases when the declaration is better left off. To handle those rare cases, you can use <code>omitXMLDeclation</code> to indicate the filter to not output the XML declaration.

For example:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options omitXMLDeclaration="yes"/>
</its:rules></pre>

Remember that XML documents without an XML declaration may be read incorrectly if the encoding of the document is not UTF-8, UTF-16 or UTF-32.

====escapeQuotes====

By default, when processing the document, the filter uses double-quotes to enclose all attributes (translatable or not) and use the following rules for escaping/not-escaping the literal quotes:

* Inside the attribute values:
** Single-quotes (=apostrophes) are never escaped
** Double-quotes are always escaped
* In element content:
** Single-quotes (=apostrophes) are not escaped
** Double-quotes are escaped by default

You cannot change the escaping rules for attributes.

For element content: If the document is processed without triggering any rule that allow the translation of an attribute, then (and only then) the filter takes into account the <code>escapeQuotes</code> option to escape or not double-quotes in the translatable content.

For example, the following parameters file allows to not escape double-quotes in element content (for the documents where there is no rule for translatable attributes are triggered):

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options escapeQuotes="no"/>
</its:rules></pre>

====escapeGT====

By default the character '<code>></code>' is escaped. You can indicate to the filter to not escape it using the <code>escapeGT</code> option.

For example, the following parameters file indicates to not escape greater-than characters:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options escapeGT="no"/>
</its:rules></pre>

====escapeNbsp====

By default the non-breaking space character is escaped (in the form <code>&#x00a0;</code>). You can indicate to the filter to not escape it using the <code>escapeNbsp</code> option.

For example, the following parameters file indicates to not escape the non-breaking space characters:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options escapeNbsp="no"/>
</its:rules></pre>

====extractIfOnlyCodes====

By default all extractable entries are extracted even when they contain only white-spaces and/or inline codes. You can indicate to the filter to not extract such entries using the <code>extractIfOnlyCodes</code> option.

For example, the following parameters file indicates to not extract entries with only whte-spaces and/or inline codes:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options extractIfOnlyCodes="no"/>
</its:rules></pre>

====inlineCdata====

By default, CDATA sections will be exposed as regular content, and the CDATA markers themselves will be discarded. When the <code>inlineCdata</code> option is set,
the CDATA markers will be exposed as inline codes.

For example, the following parameters file will expose CDATA markers as inline codes:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options inlineCdata="yes"/>
</its:rules></pre>

====extractUntranslatable====

All untranslatable entries (<code>its:translate="no"</code>) are not extracted by default. And in order to allow the extraction of such entries for context reasons, the following option has to be used: <code>extractUntranslatable</code>.

Below is an example of this option declaration:

<pre><its:rules version="1.0"
xmlns:its="http://www.w3.org/2005/11/its"
xmlns:okp="okapi-framework:xmlfilter-options">
<okp:options extractUntranslatable="yes"/>
</its:rules></pre>

==Limitations==

* Currently, in some cases, the ITS rule <code>withinTextRule</code> with the value <code>nested</code> may act like it has a value <code>yes</code> instead.
* In output, the values of the <code>xml:lang</code> attributes are not updated to reflect the target language.
* When doing the extraction, the whole input file is loaded into memory. You may run into memory limitation if the document is very large.

[[Category:Filters]] [[Category:ITS]]

OpenXML Filter

2023-08-09T03:43:02Z

Dkonovalyenko: /* Word Options */

OpenXML Filter

2023-07-07T18:37:17Z

Dkonovalyenko: /* Word Options */

IDML Filter

2023-02-12T15:06:45Z

Dkonovalyenko: /* Parameters */

{{Filters Header}}
==Overview==

This filter allows you to process IDML documents. IDML (InDesign Markup Language) is an XML-based format, introduced in Adobe InDesign CS4, for representing InDesign content. IDML is used in several InDesign and InCopy file types. The specification can be found [http://www.adobe.com/content/dam/Adobe/en/devnet/indesign/cs5_docs/idml/idml-specification.pdf on the Adobe Web site].

==Processing Details==

When processing an IDML filter, the filter looks at all the spreads in the document, and for each of them, gather the list of the stories used in <code><TextFrame></code> and <code><TextPath></code>. The text is extracted by spread, and for each spread by story in the order the appear in the spread.

Stories embedded inside other stories and not declared at a spread level are extracted in a special group.

==Parameters==

<cite>Maximum attribute size</cite> — Set the size in MB for the attribute buffer. The default is 4MB (4 * 1024 * 1024)

<cite>Untag XML Structures</cite> — Set this option to skip embedded XML structural information when extracting translatable content.

<cite>Extract notes</cite> — Set this option to extract the content of notes (<code><Note></code> elements).

<cite>Extract master spreads</cite> — Set this option to extract the content of the master spreads if they exist. If this option is not set only the normal spreads are extracted.

<cite>Extract hidden layers</cite> — Set this option to extract also the hidden layers.

<cite>Extract hidden pasteboard items</cite> — (default is false)

<cite>Skip discretionary hyphens</cite> — (default is false)

<cite>Extract breaks inline</cite> — (default is false)

<cite>Extract hyperlink text sources inline</cite> — (default is false). When it is set to true, the hyperlink text sources are extracted inline, otherwise, they are represented as referencing groups of textual units.

<cite>Extract custom text variables</cite> — (default is false)

<cite>Extract index topics</cite> — (default is false)

<cite>Extract external hyperlinks</cite> — (default is false). When it is set to true, the external hyperlinks are extracted for translation.

<cite>Ignore character kerning</cite> — (default is false)

<cite>Ignore character tracking</cite> — (default is false)

<cite>Ignore character leading</cite> — (default is false)

<cite>Ignore character baseline shift</cite> — (default is false)

<cite>Special character pattern</cite> — (default is " | | | | | | | | | ||‌||‑|"). A matched content is treated as inline code.

==Deprecated Parameters==

Prior to release M34, the filter supported several additional parameters. The behavior of these has been subsumed by the more intelligent content processing performed by the updated version of the filter in versions M34 and later.

<cite>Simplify inline codes when possible</cite> — Set this option to reduce the number of inline codes by re-grouping adjacent codes when it is possible.

<cite>Create new text units on hard returns</cite> — Set this option to create separate text units when a hard return element (<code> </code>) is found. '''IMPORTANT: This option is not completed yet. Setting it may create extracted documents you will not be able to merge back. Always test merge before use this for production.'''

<cite>Maximum spread size</cite> — Set the maximum size for the spread files (in KBytes). Any spread file above the given value will either generate an error or will be skipped from extraction depending on the specified option. This allows you to skip over large spread files that may contain only graphics and require too much memory to be opened. Note that the skipped file are not checked for translatable text.

<cite>Generate an error when a spread is larger than the specified value</cite> — Set this option to generate an error if a spread size is above the specified <cite>Maximum spread size</cite>. If this option is not set, the spread is skipped with a warning message.

[[Category:Filters]]

IDML Filter

2023-02-12T15:02:28Z

Dkonovalyenko: /* Parameters */

{{Filters Header}}
==Overview==

This filter allows you to process IDML documents. IDML (InDesign Markup Language) is an XML-based format, introduced in Adobe InDesign CS4, for representing InDesign content. IDML is used in several InDesign and InCopy file types. The specification can be found [http://www.adobe.com/content/dam/Adobe/en/devnet/indesign/cs5_docs/idml/idml-specification.pdf on the Adobe Web site].

==Processing Details==

When processing an IDML filter, the filter looks at all the spreads in the document, and for each of them, gather the list of the stories used in <code><TextFrame></code> and <code><TextPath></code>. The text is extracted by spread, and for each spread by story in the order the appear in the spread.

Stories embedded inside other stories and not declared at a spread level are extracted in a special group.

==Parameters==

<cite>Maximum attribute size</cite> — Set the size in MB for the attribute buffer. The default is 4MB (4 * 1024 * 1024)

<cite>Untag XML Structures</cite> — Set this option to skip embedded XML structural information when extracting translatable content.

<cite>Extract notes</cite> — Set this option to extract the content of notes (<code><Note></code> elements).

<cite>Extract master spreads</cite> — Set this option to extract the content of the master spreads if they exist. If this option is not set only the normal spreads are extracted.

<cite>Extract hidden layers</cite> — Set this option to extract also the hidden layers.

<cite>Extract hidden pasteboard items</cite> — (default is false)

<cite>Skip discretionary hyphens</cite> — (default is false)

<cite>Extract breaks inline</cite> — (default is false)

<cite>Extract hyperlink text sources inline</cite> — (default is false). When it is set to true, the hyperlink text sources are extracted inline, otherwise, they are represented as referencing groups of textual units.

<cite>Extract custom text variables</cite> — (default is false)

<cite>Extract index topics</cite> — (default is false)

<cite>Extract external hyperlinks</cite> — (default is false). When it is set to true, the external hyperlinks are extracted for translation.

<cite>Ignore character kerning</cite> — (default is false)

<cite>Ignore character tracking</cite> — (default is false)

<cite>Ignore character leading</cite> — (default is false)

<cite>Ignore character baseline shift</cite> — (default is false)

<cite>Special character pattern</cite> — (default is " | | | | | | | | | ||‌||‑|")

==Deprecated Parameters==

Prior to release M34, the filter supported several additional parameters. The behavior of these has been subsumed by the more intelligent content processing performed by the updated version of the filter in versions M34 and later.

<cite>Simplify inline codes when possible</cite> — Set this option to reduce the number of inline codes by re-grouping adjacent codes when it is possible.

<cite>Create new text units on hard returns</cite> — Set this option to create separate text units when a hard return element (<code> </code>) is found. '''IMPORTANT: This option is not completed yet. Setting it may create extracted documents you will not be able to merge back. Always test merge before use this for production.'''

<cite>Maximum spread size</cite> — Set the maximum size for the spread files (in KBytes). Any spread file above the given value will either generate an error or will be skipped from extraction depending on the specified option. This allows you to skip over large spread files that may contain only graphics and require too much memory to be opened. Note that the skipped file are not checked for translatable text.

<cite>Generate an error when a spread is larger than the specified value</cite> — Set this option to generate an error if a spread size is above the specified <cite>Maximum spread size</cite>. If this option is not set, the spread is skipped with a warning message.

[[Category:Filters]]

IDML Filter

2023-02-10T02:30:20Z

Dkonovalyenko: /* Parameters */

{{Filters Header}}
==Overview==

This filter allows you to process IDML documents. IDML (InDesign Markup Language) is an XML-based format, introduced in Adobe InDesign CS4, for representing InDesign content. IDML is used in several InDesign and InCopy file types. The specification can be found [http://www.adobe.com/content/dam/Adobe/en/devnet/indesign/cs5_docs/idml/idml-specification.pdf on the Adobe Web site].

==Processing Details==

When processing an IDML filter, the filter looks at all the spreads in the document, and for each of them, gather the list of the stories used in <code><TextFrame></code> and <code><TextPath></code>. The text is extracted by spread, and for each spread by story in the order the appear in the spread.

Stories embedded inside other stories and not declared at a spread level are extracted in a special group.

==Parameters==

<cite>Maximum attribute size</cite> — Set the size in MB for the attribute buffer. The default is 4MB (4 * 1024 * 1024)

<cite>Untag XML Structures</cite> — Set this option to skip embedded XML structural information when extracting translatable content.

<cite>Extract notes</cite> — Set this option to extract the content of notes (<code><Note></code> elements).

<cite>Extract master spreads</cite> — Set this option to extract the content of the master spreads if they exist. If this option is not set only the normal spreads are extracted.

<cite>Extract hidden layers</cite> — Set this option to extract also the hidden layers.

<cite>Extract hidden pasteboard items</cite> — (default is false)

<cite>Skip discretionary hyphens</cite> — (default is false)

<cite>Extract breaks inline</cite> — (default is false)

<cite>Extract hyperlink text sources inline</cite> — (default is false). When it is set to true, the hyperlink text sources are extracted inline, otherwise, they are represented as referencing groups of textual units.

<cite>Extract custom text variables</cite> — (default is false)

<cite>Extract index topics</cite> — (default is false)

<cite>Extract external hyperlinks</cite> — (default is false). When it is set to true, the external hyperlinks are extracted for translation.

<cite>Ignore character kerning</cite> — (default is false)

<cite>Ignore character tracking</cite> — (default is false)

<cite>Ignore character leading</cite> — (default is false)

<cite>Ignore character baseline shift</cite> — (default is false)

==Deprecated Parameters==

Prior to release M34, the filter supported several additional parameters. The behavior of these has been subsumed by the more intelligent content processing performed by the updated version of the filter in versions M34 and later.

<cite>Simplify inline codes when possible</cite> — Set this option to reduce the number of inline codes by re-grouping adjacent codes when it is possible.

<cite>Create new text units on hard returns</cite> — Set this option to create separate text units when a hard return element (<code> </code>) is found. '''IMPORTANT: This option is not completed yet. Setting it may create extracted documents you will not be able to merge back. Always test merge before use this for production.'''

<cite>Maximum spread size</cite> — Set the maximum size for the spread files (in KBytes). Any spread file above the given value will either generate an error or will be skipped from extraction depending on the specified option. This allows you to skip over large spread files that may contain only graphics and require too much memory to be opened. Note that the skipped file are not checked for translatable text.

<cite>Generate an error when a spread is larger than the specified value</cite> — Set this option to generate an error if a spread size is above the specified <cite>Maximum spread size</cite>. If this option is not set, the spread is skipped with a warning message.

[[Category:Filters]]

IDML Filter

2023-01-23T02:48:03Z

Dkonovalyenko: /* Parameters */

{{Filters Header}}
==Overview==

This filter allows you to process IDML documents. IDML (InDesign Markup Language) is an XML-based format, introduced in Adobe InDesign CS4, for representing InDesign content. IDML is used in several InDesign and InCopy file types. The specification can be found [http://www.adobe.com/content/dam/Adobe/en/devnet/indesign/cs5_docs/idml/idml-specification.pdf on the Adobe Web site].

==Processing Details==

When processing an IDML filter, the filter looks at all the spreads in the document, and for each of them, gather the list of the stories used in <code><TextFrame></code> and <code><TextPath></code>. The text is extracted by spread, and for each spread by story in the order the appear in the spread.

Stories embedded inside other stories and not declared at a spread level are extracted in a special group.

==Parameters==

<cite>Maximum attribute size</cite> — Set the size in MB for the attribute buffer. The default is 4MB (4 * 1024 * 1024)

<cite>Untag XML Structures</cite> — Set this option to skip embedded XML structural information when extracting translatable content.

<cite>Extract notes</cite> — Set this option to extract the content of notes (<code><Note></code> elements).

<cite>Extract master spreads</cite> — Set this option to extract the content of the master spreads if they exist. If this option is not set only the normal spreads are extracted.

<cite>Extract hidden layers</cite> — Set this option to extract also the hidden layers.

<cite>Extract hidden pasteboard items</cite> — (default is false)

<cite>Skip discretionary hyphens</cite> — (default is false)

<cite>Extract breaks inline</cite> — (default is false)

<cite>Extract hyperlink text sources inline</cite> — (default is false). When it is set, the hyperlink text sources are extracted inline, otherwise, they are represented as referencing groups of textual units.

<cite>Extract custom text variables</cite> — (default is false)

<cite>Extract index topics</cite> — (default is false)

<cite>Ignore character kerning</cite> — (default is false)

<cite>Ignore character tracking</cite> — (default is false)

<cite>Ignore character leading</cite> — (default is false)

<cite>Ignore character baseline shift</cite> — (default is false)

==Deprecated Parameters==

Prior to release M34, the filter supported several additional parameters. The behavior of these has been subsumed by the more intelligent content processing performed by the updated version of the filter in versions M34 and later.

<cite>Simplify inline codes when possible</cite> — Set this option to reduce the number of inline codes by re-grouping adjacent codes when it is possible.

<cite>Create new text units on hard returns</cite> — Set this option to create separate text units when a hard return element (<code> </code>) is found. '''IMPORTANT: This option is not completed yet. Setting it may create extracted documents you will not be able to merge back. Always test merge before use this for production.'''

<cite>Maximum spread size</cite> — Set the maximum size for the spread files (in KBytes). Any spread file above the given value will either generate an error or will be skipped from extraction depending on the specified option. This allows you to skip over large spread files that may contain only graphics and require too much memory to be opened. Note that the skipped file are not checked for translatable text.

<cite>Generate an error when a spread is larger than the specified value</cite> — Set this option to generate an error if a spread size is above the specified <cite>Maximum spread size</cite>. If this option is not set, the spread is skipped with a warning message.

[[Category:Filters]]

IDML Filter

2023-01-09T03:27:02Z

Dkonovalyenko: /* Parameters */

{{Filters Header}}
==Overview==

This filter allows you to process IDML documents. IDML (InDesign Markup Language) is an XML-based format, introduced in Adobe InDesign CS4, for representing InDesign content. IDML is used in several InDesign and InCopy file types. The specification can be found [http://www.adobe.com/content/dam/Adobe/en/devnet/indesign/cs5_docs/idml/idml-specification.pdf on the Adobe Web site].

==Processing Details==

When processing an IDML filter, the filter looks at all the spreads in the document, and for each of them, gather the list of the stories used in <code><TextFrame></code> and <code><TextPath></code>. The text is extracted by spread, and for each spread by story in the order the appear in the spread.

Stories embedded inside other stories and not declared at a spread level are extracted in a special group.

==Parameters==

<cite>Maximum attribute size</cite> — Set the size in MB for the attribute buffer. The default is 4MB (4 * 1024 * 1024)

<cite>Untag XML Structures</cite> — Set this option to skip embedded XML structural information when extracting translatable content.

<cite>Extract notes</cite> — Set this option to extract the content of notes (<code><Note></code> elements).

<cite>Extract master spreads</cite> — Set this option to extract the content of the master spreads if they exist. If this option is not set only the normal spreads are extracted.

<cite>Extract hidden layers</cite> — Set this option to extract also the hidden layers.

<cite>Extract hidden pasteboard items</cite> — (default is false)

<cite>Skip discretionary hyphens</cite> — (default is false)

<cite>Extract breaks inline</cite> — (default is false)

<cite>Extract hyperlink text sources inline</cite> — (default is true). When it is set, the hyperlink text sources are extracted inline, otherwise, they are represented as referencing groups of textual units.

<cite>Extract custom text variables</cite> — (default is false)

<cite>Extract index topics</cite> — (default is false)

<cite>Ignore character kerning</cite> — (default is false)

<cite>Ignore character tracking</cite> — (default is false)

<cite>Ignore character leading</cite> — (default is false)

<cite>Ignore character baseline shift</cite> — (default is false)

==Deprecated Parameters==

Prior to release M34, the filter supported several additional parameters. The behavior of these has been subsumed by the more intelligent content processing performed by the updated version of the filter in versions M34 and later.

<cite>Simplify inline codes when possible</cite> — Set this option to reduce the number of inline codes by re-grouping adjacent codes when it is possible.

<cite>Create new text units on hard returns</cite> — Set this option to create separate text units when a hard return element (<code> </code>) is found. '''IMPORTANT: This option is not completed yet. Setting it may create extracted documents you will not be able to merge back. Always test merge before use this for production.'''

<cite>Maximum spread size</cite> — Set the maximum size for the spread files (in KBytes). Any spread file above the given value will either generate an error or will be skipped from extraction depending on the specified option. This allows you to skip over large spread files that may contain only graphics and require too much memory to be opened. Note that the skipped file are not checked for translatable text.

<cite>Generate an error when a spread is larger than the specified value</cite> — Set this option to generate an error if a spread size is above the specified <cite>Maximum spread size</cite>. If this option is not set, the spread is skipped with a warning message.

[[Category:Filters]]

OpenXML Filter

2022-12-14T04:12:41Z

Dkonovalyenko: /* PowerPoint Options */

OpenXML Filter

2022-11-14T16:24:02Z

Dkonovalyenko: /* Excel Options */

Filters

2022-06-22T18:45:37Z

Dkonovalyenko: /* Font Mapping */

Filters are the components that convert input documents from their native file format into a common internal set of [[Glossary#Resource|resources]] that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the [[Raw Document to Filter Events Step]] and the re-writing by the [[Filter Events to Raw Document Step]].

Note: The [[Okapi Filters Plugin for OmegaT]] allows you to use some of the filters directly from [http://www.omegat.org OmegaT].

==List of the Filters==

The framework distribution comes with the following filters:

{| cellpadding="8" width=100%
|- valign="top"
|
* [[Archive Filter]]
* [[DTD Filter]]
* [[Doxygen Filter]]
* [[HTML Filter]]
* [[HTML5-ITS Filter]]
* [[ICML Filter]]
* [[IDML Filter]]
* [[JSON Filter]]
* [[Markdown Filter]]
* [[MIF Filter]]
* [[Moses Text Filter]]
* [[Multi-Parsers Filter]]
* [[OpenOffice Filter]]
* [[OpenXML Filter|OpenXML (MS Office) Filter]]
|
* [[PDF Filter]]
* [[Pensieve TM Filter]]
* [[PHP Content Filter]]
* [[Plain Text Filter]]
* [[PO Filter]]
* [[Properties Filter]]
* [[Rainbow Translation Kit Filter]]
* [[Regex Filter]]
* [[SDL Trados Package Filter]]
* [[Simplification Filter]]
* [[Table Filter]]
* [[TMX Filter]]
* [[Trados-Tagged RTF Filter]]
|
* [[Transifex Filter]]
* [[TS Filter]]
* [[TTX Filter]]
* [[TXML Filter]]
* [[Wiki Filter]]
* [[Vignette Filter]]
* [[XLIFF Filter]]
* [[XLIFF-2 Filter]]
* [[XML Filter]]
* [[XML Stream Filter]]
* [[YAML Filter]]
|}

==Supported File Formats==

The following is a list of some of the file formats supported by the distribution through [[Understanding Filter Configurations|pre-defined configurations]]:

{| border="1" cellpadding="6" cellspacing="0"
|+
| '''Format''' || '''Extensions''' || '''Pre-Defined Configuration''' || '''Filter''' || '''Notes'''
|- valign="top"
| Android Strings || .xml || <code>okf_xml-AndroidStrings</code> || [[XML Filter]] ||
|- valign="top"
| Apple Stringsdict || .stringsdict || <code>okf_xml-AppleStringsdict</code> || [[XML Filter]] ||
|- valign="top"
| Archive || .zip || <code>okf_archive</code> || [[Archive Filter]] || Meta filter that processes zip files with various formats as one file.
|- valign="top"
| Auto Xliff || .xlf, .xliff || <code>okf_autoxliff</code> || [[Auto Xliff Filter]] || Detects the version of an XLIFF file and then hands parsing off to the appropriate filter
|- valign="top"
| CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]] ||
|- valign="top"
| CSV (Multiple complex sub-formats) || .csv || <code>okf_multiparsers</code> || [[Multi-Parsers Filter]] ||
|- valign="top"
| DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]] ||
|- valign="top"
| DocBook v5.0 || .xml || <code>okf_xml-docbook</code> || [[XML Filter]] || Since Okapi 1.42. <footnote> is not handled properly.
|- valign="top"
| DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]] ||
|- valign="top"
| Doxygen-commented files || .c, .h, cpp || <code>okf_doxygen</code> || [[Doxygen Filter]] ||
|- valign="top"
| DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]] ||
|- valign="top"
| Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]] ||
|- valign="top"
| Idiom WorldServer XLIFF || .xlf || <code>okf_xliff-iws</code> || [[XLIFF Filter]] ||
|- valign="top"
| InCopy ICML || .wcml || <code>okf_icml</code> || [[ICML Filter]] ||
|- valign="top"
| InDesign IDML || .idml || <code>okf_idml</code> || [[IDML Filter]] ||
|- valign="top"
| iOS/Mac Strings|| .strings || <code>okf_regex-macStrings</code> || [[Regex Filter]] ||
|- valign="top"
| Java Properties || .properties || <code>okf_properties</code> || [[Properties Filter]] ||
|- valign="top"
| Java Properties (Output not escaped) || .properties || <code>okf_properties-outputNotEscaped</code> || [[Properties Filter]] ||
|- valign="top"
| Java XML Properties || .xml || <code>okf_xml-JavaProperties</code> || [[XML Filter]] ||
|- valign="top"
| Java XML Properties (HTML strings) || .xml || <code>okf_xmlstream-JavaPropertiesHTML</code> || [[XML Stream Filter]] ||
|- valign="top"
| JSON || .json || <code>okf_json</code> || [[JSON Filter]] ||
|- valign="top"
| Haiku CatKeys || .catkeys || <code>okf_table_catkeys</code> || [[Table Filter]] ||
|- valign="top"
| HTML (any) || .html, .htm || <code>okf_html</code> || [[HTML Filter]] ||
|- valign="top"
| HTML (Well-formed, and XHTML) || .html, .htm|| <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] ||
|- valign="top"
| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]] ||
|- valign="top"
| Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft PowerPoint 2007/2010 || .pptx, .pptm, .potx, .potm, .ppsx, .ppsm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Visio || .vsdx, .vsdm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Word 2007/2010 || .docx, .docm, .dotx, .dotm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| MIF || .mif || <code>okf_mif</code> || [[MIF Filter]] ||
|- valign="top"
| Moses Text || .txt || <code>okf_mosestext</code> || [[Moses Text Filter]] ||
|- valign="top"
| OpenOffice.org Calc || .ods, .ots || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Draw || .odg, .otg || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Impress || .odp, .otp || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Writer || .odt, .ott || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| PDF || .pdf || <code>okf_pdf</code> || [[PDF Filter]] ||
|- valign="top"
| [[Pensieve TM]] || .pentm || <code>okf_pensieve</code> || [[Pensieve TM Filter]] ||
|- valign="top"
| PHP Content || .php || <code>okf_phpcontent</code> || [[PHP Content Filter]] || Can be used as a subfilter only
|- valign="top"
| Plain Text (Line = text unit) || .txt || <code>okf_plaintext</code> || [[ Plain Text Filter]] ||
|- valign="top"
| Plain Text (Paragraph = text unit) || .txt || <code>okf_plaintext_paragraphs</code> || [[Plain Text Filter]] ||
|- valign="top"
| PO || .po || <code>okf_po</code> || [[PO Filter]] ||
|- valign="top"
| PO (Monolingual style) || .po || <code>okf_po-monolingual</code> || [[PO Filter]] ||
|- valign="top"
| Rainbow Translation Kit manifests || .rkm || <code>okf_rainbowkit</code> || [[Rainbow Translation Kit Filter]] || Used as a tkit reader only
|- valign="top"
| Regex (Any text-based format) || .txt || <code>okf_regex</code> || [[Regex Filter]] ||
|- valign="top"
| RDF (Mozilla RDF) || .rdf || <code>okf_xml-MozillaRDF</code> || [[XML Filter]] ||
|- valign="top"
| RESX || .resx || <code>okf_xml-resx</code> || [[XML Filter]] ||
|- valign="top"
| SDLPPX || .sdlppx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDLRPX || .sdlrpx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDL[[XLIFF]] || .sdlxlf || <code>okf_xliff-sdl</code> || [[XLIFF Filter]] ||
|- valign="top"
| Skype Language Files || .lang || <code>okf_properties-skypeLang</code> || [[Properties Filter]] ||
|- valign="top"
| SRT (Sub-Rip Text, sub-titles files) || .srt || <code>okf_regex-srt</code> || [[Regex Filter]] ||
|- valign="top"
| Tab-Delimiter files || .tsv, .txt || <code>okf_table_tsv</code> || [[Table Filter]] ||
|- valign="top"
| Tex files || .tex || <code>okf_tex</code> || [[TEX Filter]] ||
|- valign="top"
| [[TMX]] || .tmx || <code>okf_tmx</code> || [[TMX Filter]] ||
|- valign="top"
| Transifex project || .txp || <code>okf_transifex</code> || [[Transifex Filter]] ||
|- valign="top"
| Trados-Tagged RTF || .rtf || <code>okf_tradosrtf</code> || [[Trados-Tagged RTF Filter]] ||
|- valign="top"
| TS - Qt TS files || .ts || <code>okf_ts</code> || [[TS Filter]] ||
|- valign="top"
| TTX - Trados TagEditor TTX files || .ttx || <code>okf_ttx</code> || [[TTX Filter]] ||
|- valign="top"
| TXML - Wordfast Pro TXML files || .txml || <code>okf_txml</code> || [[TXML Filter]] ||
|- valign="top"
| Vignette Export/Import Content || .xml || <code>okf_vignette</code> || [[Vignette Filter]] ||
|- valign="top"
| XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| WIX (Windows Installer XML) localization files || .wix || <code>okf_xml-WixLocalization</code> || [[XML Filter]] ||
|- valign="top"
| [[XLIFF]] v1.2 || .xlf, .xliff || <code>okf_xliff</code> || [[XLIFF Filter]] ||
|- valign="top"
| [[XLIFF]] v2 || .xlf || <code>okf_xliff2</code> || [[XLIFF-2 Filter]] ||
|- valign="top"
| XML (Generic, using [[ITS]] defaults) || .xml || <code>okf_xml</code> || [[XML Filter]] ||
|- valign="top"
| XML (Generic, using stream reader) || .xml || <code>okf_xmlstream</code> || [[XML Stream Filter]] ||
|- valign="top"
| YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]] ||
|}

Note that most filters allow you to [[Understanding Filter Configurations|create your own configurations]] to support more file formats.

==Code Simplification Rules==

All filters support code simplification rules. By default the [[Inline Codes Simplifier Step]], [[Simplification Filter]] and [[Post-segmentation Inline Codes Removal Step]] maximize the trimming and merging (aka simplification) of inline codes. In some cases this may not be desired. The simplification rules allow you to override the default behavior and prevent specific codes from being trimmed or merged.

===General Syntax===

The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.

For more details see the JavaCC grammar: <code>../okapi-core/src/main/java/net/sf/okapi/core/simplifierrules/SimplifierRules.jj</code>

===Rule Examples===

If Code has any of these flags then don't simplify

<pre>if DELETABLE or ADDABLE or CLONEABLE;</pre>

"=" is string match
Match basic TAGTYPE opening, closing or standalone

<pre>if DATA = "a" and TAGTYPE = OPENING;</pre>

"~" is regex match

<pre>if DATA ~ "a.*";</pre>

You can negate any of the match operators
Don't simplify if the DATA does not match the regex

<pre>if DATA !~ "a.*";</pre>

Match on type, linebreak in this case, don't simplify

<pre>if the Code is a linebreak if TYPE = "lb";</pre>

Don't simplify any rich text types

<pre>if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";</pre>

Expressions can be recursive (supports embedded parens)

<pre>if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));</pre>

===Filter Config Examples===

Examples of using simplifier rules within the filter config formats used by Okapi.

'''YAML:'''

<pre>
simplifierRules: |
if ADDABLE or DELETABLE or CLONEABLE;
if DATA = " " or DATA = "" or DATA = "" or DATA = "</a>";
if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";
</pre>

'''ITS:'''

<pre>
<?xml version="1.0" encoding="UTF-8"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options">

<its:translateRule selector="//*" translate="yes"/>
<its:withinTextRule selector="//codeph" withinText="yes"/>
<its:withinTextRule selector="//ph" withinText="yes"/>
<okp:simplifierRules>
if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</okp:simplifierRules>
</its:rules>
</pre>

'''FPRM (Parameters):'''

<pre>
#v1
extractNotes.b=true
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</pre>

==Font Mapping==

The font mapping can be considered as a filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX, PPTX and XLSX documents) filters at the moment.

The following font mapping configuration options are available:
* The source locale regular expression pattern: <code>.*</code>, <code>en.*</code>, <code>en-UK</code>, etc. It can be ommited to apply the mapping to any source locale.
* The target locale regular expression pattern: <code>.*</code>, <code>ru.*</code>, <code>ru-RU</code>, etc. It can be ommited to apply the mapping to any target locale.
* The source font name regular expression pattern: <code>.*</code>, <code>Arial.*</code>, <code>Times New Roman</code>, etc. It can be ommited to apply the mapping to any source font name found.
* The target font name: <code>Arial</code>, <code>Times New Roman</code>, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.

Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential
substitution of the source font values. I.e. if there is more than one mapping:
# <code>Arial</code> -> <code>Times New Roman</code>
# <code>Times New Roman</code> -> <code>Sans Serif</code>
then the first mapping will produce <code>Times New Roman</code> replacement and the second one will be applied to this new value, thus, ending up with the <code>Sans Serif</code>.

The parameters serialisation format can look like that:

<pre>
fontMappings.0.sourceLocalePattern=en.*
fontMappings.0.targetLocalePattern=ru.*
fontMappings.0.sourceFontPattern=Times.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.1.sourceLocalePattern=ru
fontMappings.1.targetLocalePattern=fr
fontMappings.1.sourceFontPattern=The Sims Sans
fontMappings.1.targetFont=Arial Unicode MS
fontMappings.number.i=2
</pre>

When source locale, target locale and source font are omitted:

<pre>
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1
</pre>

And this is the same as the abovementioned:

<pre>
fontMappings.0.sourceLocalePattern=.*
fontMappings.0.targetLocalePattern=.*
fontMappings.0.sourceFontPattern=.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1
</pre>

[[Category:Filters]]

OpenXML Filter

2022-06-21T14:42:42Z

Dkonovalyenko: /* Excel Options */

Filters

2021-10-06T07:19:14Z

Dkonovalyenko: /* Font Mapping */

Filters are the components that convert input documents from their native file format into a common internal set of [[Glossary#Resource|resources]] that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the [[Raw Document to Filter Events Step]] and the re-writing by the [[Filter Events to Raw Document Step]].

Note: The [[Okapi Filters Plugin for OmegaT]] allows you to use some of the filters directly from [http://www.omegat.org OmegaT].

==List of the Filters==

The framework distribution comes with the following filters:

{| cellpadding="8" width=100%
|- valign="top"
|
* [[Archive Filter]]
* [[DTD Filter]]
* [[Doxygen Filter]]
* [[HTML Filter]]
* [[HTML5-ITS Filter]]
* [[ICML Filter]]
* [[IDML Filter]]
* [[JSON Filter]]
* [[Markdown Filter]]
* [[MIF Filter]]
* [[Moses Text Filter]]
* [[Multi-Parsers Filter]]
* [[OpenOffice Filter]]
* [[OpenXML Filter|OpenXML (MS Office) Filter]]
|
* [[PDF Filter]]
* [[Pensieve TM Filter]]
* [[PHP Content Filter]]
* [[Plain Text Filter]]
* [[PO Filter]]
* [[Properties Filter]]
* [[Rainbow Translation Kit Filter]]
* [[Regex Filter]]
* [[SDL Trados Package Filter]]
* [[Simplification Filter]]
* [[Table Filter]]
* [[TMX Filter]]
* [[Trados-Tagged RTF Filter]]
|
* [[Transifex Filter]]
* [[TS Filter]]
* [[TTX Filter]]
* [[TXML Filter]]
* [[Wiki Filter]]
* [[Vignette Filter]]
* [[XLIFF Filter]]
* [[XLIFF-2 Filter]]
* [[XML Filter]]
* [[XML Stream Filter]]
* [[YAML Filter]]
|}

==Supported File Formats==

The following is a list of some of the file formats supported by the distribution through [[Understanding Filter Configurations|pre-defined configurations]]:

{| border="1" cellpadding="6" cellspacing="0"
|+
| '''Format''' || '''Extensions''' || '''Pre-Defined Configuration''' || '''Filter''' || '''Notes'''
|- valign="top"
| Android Strings || .xml || <code>okf_xml-AndroidStrings</code> || [[XML Filter]] ||
|- valign="top"
| Apple Stringsdict || .stringsdict || <code>okf_xml-AppleStringsdict</code> || [[XML Filter]] ||
|- valign="top"
| Archive || .zip || <code>okf_archive</code> || [[Archive Filter]] || Meta filter that processes zip files with various formats as one file.
|- valign="top"
| Auto Xliff || .xlf, .xliff || <code>okf_autoxliff</code> || [[Auto Xliff Filter]] || Detects the version of an XLIFF file and then hands parsing off to the appropriate filter
|- valign="top"
| CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]] ||
|- valign="top"
| CSV (Multiple complex sub-formats) || .csv || <code>okf_multiparsers</code> || [[Multi-Parsers Filter]] ||
|- valign="top"
| DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]] ||
|- valign="top"
| DocBook v5.0 || .xml || <code>okf_xml-docbook</code> || [[XML Filter]] || Since Okapi 1.42. <footnote> is not handled properly.
|- valign="top"
| DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]] ||
|- valign="top"
| Doxygen-commented files || .c, .h, cpp || <code>okf_doxygen</code> || [[Doxygen Filter]] ||
|- valign="top"
| DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]] ||
|- valign="top"
| Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]] ||
|- valign="top"
| Idiom WorldServer XLIFF || .xlf || <code>okf_xliff-iws</code> || [[XLIFF Filter]] ||
|- valign="top"
| InCopy ICML || .wcml || <code>okf_icml</code> || [[ICML Filter]] ||
|- valign="top"
| InDesign IDML || .idml || <code>okf_idml</code> || [[IDML Filter]] ||
|- valign="top"
| iOS/Mac Strings|| .strings || <code>okf_regex-macStrings</code> || [[Regex Filter]] ||
|- valign="top"
| Java Properties || .properties || <code>okf_properties</code> || [[Properties Filter]] ||
|- valign="top"
| Java Properties (Output not escaped) || .properties || <code>okf_properties-outputNotEscaped</code> || [[Properties Filter]] ||
|- valign="top"
| Java XML Properties || .xml || <code>okf_xml-JavaProperties</code> || [[XML Filter]] ||
|- valign="top"
| Java XML Properties (HTML strings) || .xml || <code>okf_xmlstream-JavaPropertiesHTML</code> || [[XML Stream Filter]] ||
|- valign="top"
| JSON || .json || <code>okf_json</code> || [[JSON Filter]] ||
|- valign="top"
| Haiku CatKeys || .catkeys || <code>okf_table_catkeys</code> || [[Table Filter]] ||
|- valign="top"
| HTML (any) || .html, .htm || <code>okf_html</code> || [[HTML Filter]] ||
|- valign="top"
| HTML (Well-formed, and XHTML) || .html, .htm|| <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] ||
|- valign="top"
| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]] ||
|- valign="top"
| Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft PowerPoint 2007/2010 || .pptx, .pptm, .potx, .potm, .ppsx, .ppsm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Visio || .vsdx, .vsdm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Word 2007/2010 || .docx, .docm, .dotx, .dotm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| MIF || .mif || <code>okf_mif</code> || [[MIF Filter]] ||
|- valign="top"
| Moses Text || .txt || <code>okf_mosestext</code> || [[Moses Text Filter]] ||
|- valign="top"
| OpenOffice.org Calc || .ods, .ots || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Draw || .odg, .otg || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Impress || .odp, .otp || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Writer || .odt, .ott || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| PDF || .pdf || <code>okf_pdf</code> || [[PDF Filter]] ||
|- valign="top"
| [[Pensieve TM]] || .pentm || <code>okf_pensieve</code> || [[Pensieve TM Filter]] ||
|- valign="top"
| PHP Content || .php || <code>okf_phpcontent</code> || [[PHP Content Filter]] || Can be used as a subfilter only
|- valign="top"
| Plain Text (Line = text unit) || .txt || <code>okf_plaintext</code> || [[ Plain Text Filter]] ||
|- valign="top"
| Plain Text (Paragraph = text unit) || .txt || <code>okf_plaintext_paragraphs</code> || [[Plain Text Filter]] ||
|- valign="top"
| PO || .po || <code>okf_po</code> || [[PO Filter]] ||
|- valign="top"
| PO (Monolingual style) || .po || <code>okf_po-monolingual</code> || [[PO Filter]] ||
|- valign="top"
| Rainbow Translation Kit manifests || .rkm || <code>okf_rainbowkit</code> || [[Rainbow Translation Kit Filter]] || Used as a tkit reader only
|- valign="top"
| Regex (Any text-based format) || .txt || <code>okf_regex</code> || [[Regex Filter]] ||
|- valign="top"
| RDF (Mozilla RDF) || .rdf || <code>okf_xml-MozillaRDF</code> || [[XML Filter]] ||
|- valign="top"
| RESX || .resx || <code>okf_xml-resx</code> || [[XML Filter]] ||
|- valign="top"
| SDLPPX || .sdlppx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDLRPX || .sdlrpx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDL[[XLIFF]] || .sdlxlf || <code>okf_xliff-sdl</code> || [[XLIFF Filter]] ||
|- valign="top"
| Skype Language Files || .lang || <code>okf_properties-skypeLang</code> || [[Properties Filter]] ||
|- valign="top"
| SRT (Sub-Rip Text, sub-titles files) || .srt || <code>okf_regex-srt</code> || [[Regex Filter]] ||
|- valign="top"
| Tab-Delimiter files || .tsv, .txt || <code>okf_table_tsv</code> || [[Table Filter]] ||
|- valign="top"
| Tex files || .tex || <code>okf_tex</code> || [[TEX Filter]] ||
|- valign="top"
| [[TMX]] || .tmx || <code>okf_tmx</code> || [[TMX Filter]] ||
|- valign="top"
| Transifex project || .txp || <code>okf_transifex</code> || [[Transifex Filter]] ||
|- valign="top"
| Trados-Tagged RTF || .rtf || <code>okf_tradosrtf</code> || [[Trados-Tagged RTF Filter]] ||
|- valign="top"
| TS - Qt TS files || .ts || <code>okf_ts</code> || [[TS Filter]] ||
|- valign="top"
| TTX - Trados TagEditor TTX files || .ttx || <code>okf_ttx</code> || [[TTX Filter]] ||
|- valign="top"
| TXML - Wordfast Pro TXML files || .txml || <code>okf_txml</code> || [[TXML Filter]] ||
|- valign="top"
| Vignette Export/Import Content || .xml || <code>okf_vignette</code> || [[Vignette Filter]] ||
|- valign="top"
| XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| WIX (Windows Installer XML) localization files || .wix || <code>okf_xml-WixLocalization</code> || [[XML Filter]] ||
|- valign="top"
| [[XLIFF]] v1.2 || .xlf, .xliff || <code>okf_xliff</code> || [[XLIFF Filter]] ||
|- valign="top"
| [[XLIFF]] v2 || .xlf || <code>okf_xliff2</code> || [[XLIFF-2 Filter]] ||
|- valign="top"
| XML (Generic, using [[ITS]] defaults) || .xml || <code>okf_xml</code> || [[XML Filter]] ||
|- valign="top"
| XML (Generic, using stream reader) || .xml || <code>okf_xmlstream</code> || [[XML Stream Filter]] ||
|- valign="top"
| YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]] ||
|}

Note that most filters allow you to [[Understanding Filter Configurations|create your own configurations]] to support more file formats.

==Code Simplification Rules==

All filters support code simplification rules. By default the [[Inline Codes Simplifier Step]], [[Simplification Filter]] and [[Post-segmentation Inline Codes Removal Step]] maximize the trimming and merging (aka simplification) of inline codes. In some cases this may not be desired. The simplification rules allow you to override the default behavior and prevent specific codes from being trimmed or merged.

===General Syntax===

The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.

For more details see the JavaCC grammar: <code>../okapi-core/src/main/java/net/sf/okapi/core/simplifierrules/SimplifierRules.jj</code>

===Rule Examples===

If Code has any of these flags then don't simplify

<pre>if DELETABLE or ADDABLE or CLONEABLE;</pre>

"=" is string match
Match basic TAGTYPE opening, closing or standalone

<pre>if DATA = "a" and TAGTYPE = OPENING;</pre>

"~" is regex match

<pre>if DATA ~ "a.*";</pre>

You can negate any of the match operators
Don't simplify if the DATA does not match the regex

<pre>if DATA !~ "a.*";</pre>

Match on type, linebreak in this case, don't simplify

<pre>if the Code is a linebreak if TYPE = "lb";</pre>

Don't simplify any rich text types

<pre>if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";</pre>

Expressions can be recursive (supports embedded parens)

<pre>if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));</pre>

===Filter Config Examples===

Examples of using simplifier rules within the filter config formats used by Okapi.

'''YAML:'''

<pre>
simplifierRules: |
if ADDABLE or DELETABLE or CLONEABLE;
if DATA = " " or DATA = "" or DATA = "" or DATA = "</a>";
if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";
</pre>

'''ITS:'''

<pre>
<?xml version="1.0" encoding="UTF-8"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options">

<its:translateRule selector="//*" translate="yes"/>
<its:withinTextRule selector="//codeph" withinText="yes"/>
<its:withinTextRule selector="//ph" withinText="yes"/>
<okp:simplifierRules>
if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</okp:simplifierRules>
</its:rules>
</pre>

'''FPRM (Parameters):'''

<pre>
#v1
extractNotes.b=true
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</pre>

==Font Mapping==

The font mapping can be considered as filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX and PPTX documents) filters at the moment.

The following font mapping configuration options are available:
* The source locale regular expression pattern: <code>.*</code>, <code>en.*</code>, <code>en-UK</code>, etc. It can be ommited to apply the mapping to any source locale.
* The target locale regular expression pattern: <code>.*</code>, <code>ru.*</code>, <code>ru-RU</code>, etc. It can be ommited to apply the mapping to any target locale.
* The source font name regular expression pattern: <code>.*</code>, <code>Arial.*</code>, <code>Times New Roman</code>, etc. It can be ommited to apply the mapping to any source font name found.
* The target font name: <code>Arial</code>, <code>Times New Roman</code>, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.

Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential
substitution of the source font values. I.e. if there is more than one mapping:
# <code>Arial</code> -> <code>Times New Roman</code>
# <code>Times New Roman</code> -> <code>Sans Serif</code>
then the first mapping will produce <code>Times New Roman</code> replacement and the second one will be applied to this new value, thus, ending up with the <code>Sans Serif</code>.

The parameters serialisation format can look like that:

<pre>
fontMappings.0.sourceLocalePattern=en.*
fontMappings.0.targetLocalePattern=ru.*
fontMappings.0.sourceFontPattern=Times.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.1.sourceLocalePattern=ru
fontMappings.1.targetLocalePattern=fr
fontMappings.1.sourceFontPattern=The Sims Sans
fontMappings.1.targetFont=Arial Unicode MS
fontMappings.number.i=2
</pre>

When source locale, target locale and source font are omitted:

<pre>
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1
</pre>

And this is the same as the abovementioned:

<pre>
fontMappings.0.sourceLocalePattern=.*
fontMappings.0.targetLocalePattern=.*
fontMappings.0.sourceFontPattern=.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1
</pre>

[[Category:Filters]]

FAQ

2021-10-06T06:56:22Z

Dkonovalyenko: /* Is there a users group or a support mailing list? */

==Capabilities==

====What formats are supported?====

The framework offers filters for many file formats, including XML, XLIFF, TMX, HTML, DOCX, ODT, Properties, PO, and many more. 
For a more complete list of the supported formats, see the "[[Filters]]" page.

Note that you can also create your own filter configurations to support some formats. You can also create your own filters and use them seamlessly with the Okapi tools.

====How do I extract text for translation?====

See the article "[[How to Extract Text for Translation]]" in the [[Knowledge Base]].

====Does Okapi provide a translation editor?====

Not at this time. The Okapi tools allow you to create translation packages in various formats that can be opened in different translation editors such as OmegaT, MemoQ, Trados Workbench, Swordfish, Wordfast, etc.

For translating XLIFF files see: "[[How to Translate XLIFF Documents]]".

====Does Okapi provide a TM (Translation Memory)?====

Yes. There are currently two TM engines implemented in the framework:

* [[Pensieve TM]] is the main TM engine.
* [[SimpleTM TM]] is a limited and older engine that '''is being progressively phased out'''.

You can also use third-part TM engines through the the different [[Connectors|connectors]] that the framework provides. For example: the [[Translate Toolkit TM Connector|Translate Toolkit TM]], [[GlobalSight TM Connector|GlobalSight TM]], the [[OpenTran Translation Repository Connector|OpenTran Translation Repository]], [[MyMemory TM Connector|MyMemory]], etc. For a complete list and more details see the "[[Connectors]]" page.

====Does Okapi provide a MT (Machine Translation) system?====

Not at this time. But you can use different third-party MT system using one of the connectors distributed with the framework. For example you can work with [[Google MT v2 Connector|Google MT]], [[Apertium MT Connector|Apertium MT]], [[Microsoft Translator Connector|Microsoft Translator]], etc. For a complete list, see the [[Connectors|Connectors page]].

====Why is there several distributions, isn't Java cross-platform?====

Yes, Java is cross-platform, and most of the Okapi code runs anywhere Java runs.
However, for a better internationalization support and a more seamless integration with each platform, we have selected to use Eclipse SWT (http://www.eclipse.org/swt) as the foundation for the UI of our applications. That library requires a different distribution for each platform and architecture.

Okapi's source code has been carefully designed to separate UI-dependant code and non-UI code, so most of the components (such as the [[Filters]], the [[Steps]] and the [[Connectors]]) can be used on any platform.

====Can I change the Java VM settings when running the tools?====

Yes. See [[How to Change the Java Parameters for Rainbow]]. You can follow the same steps for all Okapi tools.

==Simple Troubleshooting==

====Is there a Getting Started guide?====

Yes. See the "[[Getting Started]]" page.

====When I try to start Rainbow/Ratel/CheckMate nothing happens. What is wrong?====

* Check that you have the proper version of Java (1.7 or above).
* Make sure you have installed the correct distribution for your platform.
* If your machine is 32-bit make sure to have installed the 32-bit distribution.
* If your machine is 64-bit make sure to have installed the 64-bit distribution.

==Licenses==

====Under what licence the Okapi Framework is developed?====

* The source code is under [https://www.apache.org/licenses/LICENSE-2.0 Apache Licence version 2.0].
* The documentation is under [http://creativecommons.org/licenses/by-sa/3.0/ Creative Commons Attribution-ShareAlike License (CC-BY-SA)].

====Can I use Okapi's components in my applications?====

Yes. The project uses the Apache license which allows open-source or commercial products to use our applications and components. See more information the license at [https://www.apache.org/licenses/LICENSE-2.0].

==Support==

====Is there a users group or a support mailing list?====

Yes. There are two main mailing lists. Both have public archives, and both require registration to post a message:

* [https://groups.google.com/g/okapi-users Okapi users] is the group and mailing list '''for the end users'''.
* [http://groups.google.com/g/okapi-devel Okapi developers] is the group and mailing list '''for the developers''' working on the source code.

====How do I report bugs or request enhancement?====

* You can post a bug report or an enhancement request on [https://bitbucket.org/okapiframework/okapi/issues the issues tracking page] if you have a Bitbucket account (preferred).

* You can post a message to [https://groups.google.com/g/okapi-users the Okapi users group] if you are part of the group.

* You can just [mailto:okapitools@opentag.com&subject=Feedback send feedback by email].

==Miscellaneous==

====What does 'Okapi' mean?====

An okapi is an African animal looking somewhat like [http://en.wikipedia.org/wiki/Okapi a cross between a zebra and a giraffe]. Okapi is pronounced [http://en.wikipedia.org/wiki/Wikipedia:IPA_for_English /oʊˈkɑːpɪ/] ([http://www.m-w.com/cgi-bin/audio.pl?okapi001.wav=okapi hear it])

The usage of this name for the framework has its roots to much older projects. At some point it was an acronym for "Open Kit API".

====What happened to the .NET Okapi?====

The older version of the Okapi Framework for .NET is no longer developed. Its distribution and source code is still available here: http://sourceforge.net/projects/okapi/. All new development is now done in the Java branch.

====Where is Olifant?====

Olifant, the TMX editor, is currently only part of the .NET Okapi. It is still available [http://sourceforge.net/projects/okapi/files/ from the SourceForge project]. Note that Olifant is for Windows only.

==For developers==

====Getting set up====

* Check out the source code from Bitbucket using git clone: https://bitbucket.org/okapiframework/okapi
* Or, if you want to submit pull requests, first create a fork of the Okapi project.
* Import into your IDE. For example, in Eclipse go to File > Import > Maven > Existing Maven project.
If you want to keep several distinct Okapi repositories in the same Eclipse workspace (for instance, your fork and the main Okapi project), you need to assign a name template under the "Advanced" section in the first step of the import wizard.
* The "master" branch contains the latest release version. The "dev" branch contains the current work (the "snapshot" in Maven terms).
* See also: https://bitbucket.org/okapiframework/okapi/wiki/How%20to%20Contribute
Happy coding!

====How to build okapi-lib locally====

The Okapi Framework consists of Maven projects. However, in order to build the apps and lib projects locally, you need to use the Ant build configurations.

For instance, to create a local version of okapi-lib.jar, go to <OKAPI_HOME>/deployment/maven/ and run ant -f build_okapi-lib.xml init okapiLib. The jar will be generated in <OKAPI_HOME>/deployment/maven/dist_common/lib/.

If you use the default build.xml by running above command without the -f option, platform-specific distributions of the apps will be created plus the platform-indipendent okapi-lib.jar.

FAQ

2021-10-06T06:51:13Z

Dkonovalyenko: /* How do I report bugs or request enhancement? */

==Capabilities==

====What formats are supported?====

The framework offers filters for many file formats, including XML, XLIFF, TMX, HTML, DOCX, ODT, Properties, PO, and many more. 
For a more complete list of the supported formats, see the "[[Filters]]" page.

Note that you can also create your own filter configurations to support some formats. You can also create your own filters and use them seamlessly with the Okapi tools.

====How do I extract text for translation?====

See the article "[[How to Extract Text for Translation]]" in the [[Knowledge Base]].

====Does Okapi provide a translation editor?====

Not at this time. The Okapi tools allow you to create translation packages in various formats that can be opened in different translation editors such as OmegaT, MemoQ, Trados Workbench, Swordfish, Wordfast, etc.

For translating XLIFF files see: "[[How to Translate XLIFF Documents]]".

====Does Okapi provide a TM (Translation Memory)?====

Yes. There are currently two TM engines implemented in the framework:

* [[Pensieve TM]] is the main TM engine.
* [[SimpleTM TM]] is a limited and older engine that '''is being progressively phased out'''.

You can also use third-part TM engines through the the different [[Connectors|connectors]] that the framework provides. For example: the [[Translate Toolkit TM Connector|Translate Toolkit TM]], [[GlobalSight TM Connector|GlobalSight TM]], the [[OpenTran Translation Repository Connector|OpenTran Translation Repository]], [[MyMemory TM Connector|MyMemory]], etc. For a complete list and more details see the "[[Connectors]]" page.

====Does Okapi provide a MT (Machine Translation) system?====

Not at this time. But you can use different third-party MT system using one of the connectors distributed with the framework. For example you can work with [[Google MT v2 Connector|Google MT]], [[Apertium MT Connector|Apertium MT]], [[Microsoft Translator Connector|Microsoft Translator]], etc. For a complete list, see the [[Connectors|Connectors page]].

====Why is there several distributions, isn't Java cross-platform?====

Yes, Java is cross-platform, and most of the Okapi code runs anywhere Java runs.
However, for a better internationalization support and a more seamless integration with each platform, we have selected to use Eclipse SWT (http://www.eclipse.org/swt) as the foundation for the UI of our applications. That library requires a different distribution for each platform and architecture.

Okapi's source code has been carefully designed to separate UI-dependant code and non-UI code, so most of the components (such as the [[Filters]], the [[Steps]] and the [[Connectors]]) can be used on any platform.

====Can I change the Java VM settings when running the tools?====

Yes. See [[How to Change the Java Parameters for Rainbow]]. You can follow the same steps for all Okapi tools.

==Simple Troubleshooting==

====Is there a Getting Started guide?====

Yes. See the "[[Getting Started]]" page.

====When I try to start Rainbow/Ratel/CheckMate nothing happens. What is wrong?====

* Check that you have the proper version of Java (1.7 or above).
* Make sure you have installed the correct distribution for your platform.
* If your machine is 32-bit make sure to have installed the 32-bit distribution.
* If your machine is 64-bit make sure to have installed the 64-bit distribution.

==Licenses==

====Under what licence the Okapi Framework is developed?====

* The source code is under [https://www.apache.org/licenses/LICENSE-2.0 Apache Licence version 2.0].
* The documentation is under [http://creativecommons.org/licenses/by-sa/3.0/ Creative Commons Attribution-ShareAlike License (CC-BY-SA)].

====Can I use Okapi's components in my applications?====

Yes. The project uses the Apache license which allows open-source or commercial products to use our applications and components. See more information the license at [https://www.apache.org/licenses/LICENSE-2.0].

==Support==

====Is there a users group or a support mailing list?====

Yes. There are two main mailing lists. Both have public archives, and both require registration to post a message:

* [http://groups.yahoo.com/group/okapitools/ https://groups.yahoo.com/group/okapitools/] is the group and mailing list '''for the end users'''.
* [http://groups.google.com/group/okapi-devel https://groups.google.com/group/okapi-devel] is the group and mailing list '''for the developers''' working on the source code.

====How do I report bugs or request enhancement?====

* You can post a bug report or an enhancement request on [https://bitbucket.org/okapiframework/okapi/issues the issues tracking page] if you have a Bitbucket account (preferred).

* You can post a message to [https://groups.google.com/g/okapi-users the Okapi users group] if you are part of the group.

* You can just [mailto:okapitools@opentag.com&subject=Feedback send feedback by email].

==Miscellaneous==

====What does 'Okapi' mean?====

An okapi is an African animal looking somewhat like [http://en.wikipedia.org/wiki/Okapi a cross between a zebra and a giraffe]. Okapi is pronounced [http://en.wikipedia.org/wiki/Wikipedia:IPA_for_English /oʊˈkɑːpɪ/] ([http://www.m-w.com/cgi-bin/audio.pl?okapi001.wav=okapi hear it])

The usage of this name for the framework has its roots to much older projects. At some point it was an acronym for "Open Kit API".

====What happened to the .NET Okapi?====

The older version of the Okapi Framework for .NET is no longer developed. Its distribution and source code is still available here: http://sourceforge.net/projects/okapi/. All new development is now done in the Java branch.

====Where is Olifant?====

Olifant, the TMX editor, is currently only part of the .NET Okapi. It is still available [http://sourceforge.net/projects/okapi/files/ from the SourceForge project]. Note that Olifant is for Windows only.

==For developers==

====Getting set up====

* Check out the source code from Bitbucket using git clone: https://bitbucket.org/okapiframework/okapi
* Or, if you want to submit pull requests, first create a fork of the Okapi project.
* Import into your IDE. For example, in Eclipse go to File > Import > Maven > Existing Maven project.
If you want to keep several distinct Okapi repositories in the same Eclipse workspace (for instance, your fork and the main Okapi project), you need to assign a name template under the "Advanced" section in the first step of the import wizard.
* The "master" branch contains the latest release version. The "dev" branch contains the current work (the "snapshot" in Maven terms).
* See also: https://bitbucket.org/okapiframework/okapi/wiki/How%20to%20Contribute
Happy coding!

====How to build okapi-lib locally====

The Okapi Framework consists of Maven projects. However, in order to build the apps and lib projects locally, you need to use the Ant build configurations.

For instance, to create a local version of okapi-lib.jar, go to <OKAPI_HOME>/deployment/maven/ and run ant -f build_okapi-lib.xml init okapiLib. The jar will be generated in <OKAPI_HOME>/deployment/maven/dist_common/lib/.

If you use the default build.xml by running above command without the -f option, platform-specific distributions of the apps will be created plus the platform-indipendent okapi-lib.jar.

Filters

2020-07-02T16:40:17Z

Dkonovalyenko: /* Font Mapping */

Filters are the components that convert input documents from their native file format into a common internal set of [[Glossary#Resource|resources]] that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the [[Raw Document to Filter Events Step]] and the re-writing by the [[Filter Events to Raw Document Step]].

Note: The [[Okapi Filters Plugin for OmegaT]] allows you to use some of the filters directly from [http://www.omegat.org OmegaT].

==List of the Filters==

The framework distribution comes with the following filters:

{| cellpadding="8" width=100%
|- valign="top"
|
* [[Archive Filter]]
* [[DTD Filter]]
* [[Doxygen Filter]]
* [[HTML Filter]]
* [[HTML5-ITS Filter]]
* [[ICML Filter]]
* [[IDML Filter]]
* [[JSON Filter]]
* [[Markdown Filter]]
* [[MIF Filter]]
* [[Moses Text Filter]]
* [[Multi-Parsers Filter]]
* [[OpenOffice Filter]]
* [[OpenXML Filter|OpenXML (MS Office) Filter]]
|
* [[PDF Filter]]
* [[Pensieve TM Filter]]
* [[PHP Content Filter]]
* [[Plain Text Filter]]
* [[PO Filter]]
* [[Properties Filter]]
* [[Rainbow Translation Kit Filter]]
* [[Regex Filter]]
* [[SDL Trados Package Filter]]
* [[Simplification Filter]]
* [[Table Filter]]
* [[TMX Filter]]
* [[Trados-Tagged RTF Filter]]
|
* [[Transifex Filter]]
* [[TS Filter]]
* [[TTX Filter]]
* [[TXML Filter]]
* [[Wiki Filter]]
* [[Vignette Filter]]
* [[XLIFF Filter]]
* [[XLIFF-2 Filter]]
* [[XML Filter]]
* [[XML Stream Filter]]
* [[YAML Filter]]
|}

==Supported File Formats==

The following is a list of some of the file formats supported by the distribution through [[Understanding Filter Configurations|pre-defined configurations]]:

{| border="1" cellpadding="6" cellspacing="0"
|+
| '''Format''' || '''Extensions''' || '''Pre-Defined Configuration''' || '''Filter''' || '''Notes'''
|- valign="top"
| Android Strings || .xml || <code>okf_xml-AndroidStrings</code> || [[XML Filter]] ||
|- valign="top"
| Apple Stringsdict || .stringsdict || <code>okf_xml-AppleStringsdict</code> || [[XML Filter]] ||
|- valign="top"
| Archive || .zip || <code>okf_archive</code> || [[Archive Filter]] || Meta filter that processes zip files with various formats as one file.
|- valign="top"
| Auto Xliff || .xlf, .xliff || <code>okf_autoxliff</code> || [[Auto Xliff Filter]] || Detects the version of an XLIFF file and then hands parsing off to the appropriate filter
|- valign="top"
| CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]] ||
|- valign="top"
| CSV (Multiple complex sub-formats) || .csv || <code>okf_multiparsers</code> || [[Multi-Parsers Filter]] ||
|- valign="top"
| DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]] ||
|- valign="top"
| DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]] ||
|- valign="top"
| Doxygen-commented files || .c, .h, cpp || <code>okf_doxygen</code> || [[Doxygen Filter]] ||
|- valign="top"
| DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]] ||
|- valign="top"
| Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]] ||
|- valign="top"
| Idiom WorldServer XLIFF || .xlf || <code>okf_xliff-iws</code> || [[XLIFF Filter]] ||
|- valign="top"
| InCopy ICML || .wcml || <code>okf_icml</code> || [[ICML Filter]] ||
|- valign="top"
| InDesign IDML || .idml || <code>okf_idml</code> || [[IDML Filter]] ||
|- valign="top"
| iOS/Mac Strings|| .strings || <code>okf_regex-macStrings</code> || [[Regex Filter]] ||
|- valign="top"
| Java Properties || .properties || <code>okf_properties</code> || [[Properties Filter]] ||
|- valign="top"
| Java Properties (Output not escaped) || .properties || <code>okf_properties-outputNotEscaped</code> || [[Properties Filter]] ||
|- valign="top"
| Java XML Properties || .xml || <code>okf_xml-JavaProperties</code> || [[XML Filter]] ||
|- valign="top"
| Java XML Properties (HTML strings) || .xml || <code>okf_xmlstream-JavaPropertiesHTML</code> || [[XML Stream Filter]] ||
|- valign="top"
| JSON || .json || <code>okf_json</code> || [[JSON Filter]] ||
|- valign="top"
| Haiku CatKeys || .catkeys || <code>okf_table_catkeys</code> || [[Table Filter]] ||
|- valign="top"
| HTML (any) || .html, .htm || <code>okf_html</code> || [[HTML Filter]] ||
|- valign="top"
| HTML (Well-formed, and XHTML) || .html, .htm|| <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] ||
|- valign="top"
| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]]
|- valign="top"
| Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft PowerPoint 2007/2010 || .pptx, .pptm, .potx, .potm, .ppsx, .ppsm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Visio || .vsdx, .vsdm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Word 2007/2010 || .docx, .docm, .dotx, .dotm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| MIF || .mif || <code>okf_mif</code> || [[MIF Filter]] ||
|- valign="top"
| Moses Text || .txt || <code>okf_mosestext</code> || [[Moses Text Filter]] ||
|- valign="top"
| OpenOffice.org Calc || .ods, .ots || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Draw || .odg, .otg || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Impress || .odp, .otp || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Writer || .odt, .ott || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| PDF || .pdf || <code>okf_pdf</code> || [[PDF Filter]] ||
|- valign="top"
| [[Pensieve TM]] || .pentm || <code>okf_pensieve</code> || [[Pensieve TM Filter]] ||
|- valign="top"
| PHP Content || .php || <code>okf_phpcontent</code> || [[PHP Content Filter]] || Can be used as a subfilter only
|- valign="top"
| Plain Text (Line = text unit) || .txt || <code>okf_plaintext</code> || [[ Plain Text Filter]] ||
|- valign="top"
| Plain Text (Paragraph = text unit) || .txt || <code>okf_plaintext_paragraphs</code> || [[Plain Text Filter]] ||
|- valign="top"
| PO || .po || <code>okf_po</code> || [[PO Filter]] ||
|- valign="top"
| PO (Monolingual style) || .po || <code>okf_po-monolingual</code> || [[PO Filter]] ||
|- valign="top"
| Rainbow Translation Kit manifests || .rkm || <code>okf_rainbowkit</code> || [[Rainbow Translation Kit Filter]] || Used as a tkit reader only
|- valign="top"
| Regex (Any text-based format) || .txt || <code>okf_regex</code> || [[Regex Filter]] ||
|- valign="top"
| RDF (Mozilla RDF) || .rdf || <code>okf_xml-MozillaRDF</code> || [[XML Filter]] ||
|- valign="top"
| RESX || .resx || <code>okf_xml-resx</code> || [[XML Filter]] ||
|- valign="top"
| SDLPPX || .sdlppx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDLRPX || .sdlrpx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDL[[XLIFF]] || .sdlxlf || <code>okf_xliff-sdl</code> || [[XLIFF Filter]] ||
|- valign="top"
| Skype Language Files || .lang || <code>okf_properties-skypeLang</code> || [[Properties Filter]] ||
|- valign="top"
| SRT (Sub-Rip Text, sub-titles files) || .srt || <code>okf_regex-srt</code> || [[Regex Filter]] ||
|- valign="top"
| Tab-Delimiter files || .tsv, .txt || <code>okf_table_tsv</code> || [[Table Filter]] ||
|- valign="top"
| Tex files || .tex || <code>okf_tex</code> || [[TEX Filter]] ||
|- valign="top"
| [[TMX]] || .tmx || <code>okf_tmx</code> || [[TMX Filter]] ||
|- valign="top"
| Transifex project || .txp || <code>okf_transifex</code> || [[Transifex Filter]] ||
|- valign="top"
| Trados-Tagged RTF || .rtf || <code>okf_tradosrtf</code> || [[Trados-Tagged RTF Filter]] ||
|- valign="top"
| TS - Qt TS files || .ts || <code>okf_ts</code> || [[TS Filter]] ||
|- valign="top"
| TTX - Trados TagEditor TTX files || .ttx || <code>okf_ttx</code> || [[TTX Filter]] ||
|- valign="top"
| TXML - Wordfast Pro TXML files || .txml || <code>okf_txml</code> || [[TXML Filter]] ||
|- valign="top"
| Vignette Export/Import Content || .xml || <code>okf_vignette</code> || [[Vignette Filter]] ||
|- valign="top"
| XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| WIX (Windows Installer XML) localization files || .wix || <code>okf_xml-WixLocalization</code> || [[XML Filter]] ||
|- valign="top"
| [[XLIFF]] v1.2 || .xlf, .xliff || <code>okf_xliff</code> || [[XLIFF Filter]] ||
|- valign="top"
| [[XLIFF]] v2 || .xlf || <code>okf_xliff2</code> || [[XLIFF-2 Filter]] ||
|- valign="top"
| XML (Generic, using [[ITS]] defaults) || .xml || <code>okf_xml</code> || [[XML Filter]] ||
|- valign="top"
| XML (Generic, using stream reader) || .xml || <code>okf_xmlstream</code> || [[XML Stream Filter]] ||
|- valign="top"
| YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]] ||
|}

Note that most filters allow you to [[Understanding Filter Configurations|create your own configurations]] to support more file formats.

==Code Simplification Rules==

All filters support code simplification rules. By default the [[Inline Codes Simplifier Step]], [[Simplification Filter]] and [[Post-segmentation Inline Codes Removal Step]] maximize the trimming and merging (aka simplification) of inline codes. In some cases this may not be desired. The simplification rules allow you to override the default behavior and prevent specific codes from being trimmed or merged.

===General Syntax===

The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.

For more details see the JavaCC grammar: <code>../okapi-core/src/main/java/net/sf/okapi/core/simplifierrules/SimplifierRules.jj</code>

===Rule Examples===

If Code has any of these flags then don't simplify

<pre>if DELETABLE or ADDABLE or CLONEABLE;</pre>

"=" is string match
Match basic TAGTYPE opening, closing or standalone

<pre>if DATA = "a" and TAGTYPE = OPENING;</pre>

"~" is regex match

<pre>if DATA ~ "a.*";</pre>

You can negate any of the match operators
Don't simplify if the DATA does not match the regex

<pre>if DATA !~ "a.*";</pre>

Match on type, linebreak in this case, don't simplify

<pre>if the Code is a linebreak if TYPE = "lb";</pre>

Don't simplify any rich text types

<pre>if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";</pre>

Expressions can be recursive (supports embedded parens)

<pre>if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));</pre>

===Filter Config Examples===

Examples of using simplifier rules within the filter config formats used by Okapi.

'''YAML:'''

<pre>
simplifierRules: |
if ADDABLE or DELETABLE or CLONEABLE;
if DATA = " " or DATA = "" or DATA = "" or DATA = "</a>";
if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";
</pre>

'''ITS:'''

<pre>
<?xml version="1.0" encoding="UTF-8"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options">

<its:translateRule selector="//*" translate="yes"/>
<its:withinTextRule selector="//codeph" withinText="yes"/>
<its:withinTextRule selector="//ph" withinText="yes"/>
<okp:simplifierRules>
if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</okp:simplifierRules>
</its:rules>
</pre>

'''FPRM (Parameters):'''

<pre>
#v1
extractNotes.b=true
simplifyCodes.b=true
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</pre>

==Font Mapping==

The font mapping can be considered as filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX documents) filters at the moment.

The following font mapping configuration options are available:
* The source language regular expression pattern: <code>en.*</code>, <code>en-UK</code>, etc. It can be left empty to apply the mapping to any source language.
* The target language regular expression pattern: <code>ru.*</code>, <code>ru-RU</code>, etc. It can be left empty to apply the mapping to any target language.
* The source font name regular expression pattern: <code>Arial.*</code>, <code>Times New Roman</code>, etc. It can be left empty to apply the mapping to any source font name found.
* The target font name: <code>Arial</code>, <code>Times New Roman</code>, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.

Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential
substitution of the source font values. I.e. if there is more than one mapping:
# <code>Arial</code> -> <code>Times New Roman</code>
# <code>Times New Roman</code> -> <code>Sans Serif</code>
then the first mapping will produce <code>Times New Roman</code> replacement and the second one will be applied to this new value, thus, ending up with the <code>Sans Serif</code>.

The parameters serialisation format can look like that:

<pre>
fontMappings.0.sourceLocalePattern=en.*
fontMappings.0.targetLocalePattern=ru.*
fontMappings.0.sourceFontPattern=Times.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.1.sourceLocalePattern=ru
fontMappings.1.targetLocalePattern=fr
fontMappings.1.sourceFontPattern=The Sims Sans
fontMappings.1.targetFont=Arial Unicode MS
fontMappings.number.i=2
</pre>

[[Category:Filters]]

Filters

2020-07-02T16:21:08Z

Dkonovalyenko:

Filters are the components that convert input documents from their native file format into a common internal set of [[Glossary#Resource|resources]] that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the [[Raw Document to Filter Events Step]] and the re-writing by the [[Filter Events to Raw Document Step]].

Note: The [[Okapi Filters Plugin for OmegaT]] allows you to use some of the filters directly from [http://www.omegat.org OmegaT].

==List of the Filters==

The framework distribution comes with the following filters:

{| cellpadding="8" width=100%
|- valign="top"
|
* [[Archive Filter]]
* [[DTD Filter]]
* [[Doxygen Filter]]
* [[HTML Filter]]
* [[HTML5-ITS Filter]]
* [[ICML Filter]]
* [[IDML Filter]]
* [[JSON Filter]]
* [[Markdown Filter]]
* [[MIF Filter]]
* [[Moses Text Filter]]
* [[Multi-Parsers Filter]]
* [[OpenOffice Filter]]
* [[OpenXML Filter|OpenXML (MS Office) Filter]]
|
* [[PDF Filter]]
* [[Pensieve TM Filter]]
* [[PHP Content Filter]]
* [[Plain Text Filter]]
* [[PO Filter]]
* [[Properties Filter]]
* [[Rainbow Translation Kit Filter]]
* [[Regex Filter]]
* [[SDL Trados Package Filter]]
* [[Simplification Filter]]
* [[Table Filter]]
* [[TMX Filter]]
* [[Trados-Tagged RTF Filter]]
|
* [[Transifex Filter]]
* [[TS Filter]]
* [[TTX Filter]]
* [[TXML Filter]]
* [[Wiki Filter]]
* [[Vignette Filter]]
* [[XLIFF Filter]]
* [[XLIFF-2 Filter]]
* [[XML Filter]]
* [[XML Stream Filter]]
* [[YAML Filter]]
|}

==Supported File Formats==

The following is a list of some of the file formats supported by the distribution through [[Understanding Filter Configurations|pre-defined configurations]]:

{| border="1" cellpadding="6" cellspacing="0"
|+
| '''Format''' || '''Extensions''' || '''Pre-Defined Configuration''' || '''Filter''' || '''Notes'''
|- valign="top"
| Android Strings || .xml || <code>okf_xml-AndroidStrings</code> || [[XML Filter]] ||
|- valign="top"
| Apple Stringsdict || .stringsdict || <code>okf_xml-AppleStringsdict</code> || [[XML Filter]] ||
|- valign="top"
| Archive || .zip || <code>okf_archive</code> || [[Archive Filter]] || Meta filter that processes zip files with various formats as one file.
|- valign="top"
| Auto Xliff || .xlf, .xliff || <code>okf_autoxliff</code> || [[Auto Xliff Filter]] || Detects the version of an XLIFF file and then hands parsing off to the appropriate filter
|- valign="top"
| CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]] ||
|- valign="top"
| CSV (Multiple complex sub-formats) || .csv || <code>okf_multiparsers</code> || [[Multi-Parsers Filter]] ||
|- valign="top"
| DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]] ||
|- valign="top"
| DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]] ||
|- valign="top"
| Doxygen-commented files || .c, .h, cpp || <code>okf_doxygen</code> || [[Doxygen Filter]] ||
|- valign="top"
| DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]] ||
|- valign="top"
| Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]] ||
|- valign="top"
| Idiom WorldServer XLIFF || .xlf || <code>okf_xliff-iws</code> || [[XLIFF Filter]] ||
|- valign="top"
| InCopy ICML || .wcml || <code>okf_icml</code> || [[ICML Filter]] ||
|- valign="top"
| InDesign IDML || .idml || <code>okf_idml</code> || [[IDML Filter]] ||
|- valign="top"
| iOS/Mac Strings|| .strings || <code>okf_regex-macStrings</code> || [[Regex Filter]] ||
|- valign="top"
| Java Properties || .properties || <code>okf_properties</code> || [[Properties Filter]] ||
|- valign="top"
| Java Properties (Output not escaped) || .properties || <code>okf_properties-outputNotEscaped</code> || [[Properties Filter]] ||
|- valign="top"
| Java XML Properties || .xml || <code>okf_xml-JavaProperties</code> || [[XML Filter]] ||
|- valign="top"
| Java XML Properties (HTML strings) || .xml || <code>okf_xmlstream-JavaPropertiesHTML</code> || [[XML Stream Filter]] ||
|- valign="top"
| JSON || .json || <code>okf_json</code> || [[JSON Filter]] ||
|- valign="top"
| Haiku CatKeys || .catkeys || <code>okf_table_catkeys</code> || [[Table Filter]] ||
|- valign="top"
| HTML (any) || .html, .htm || <code>okf_html</code> || [[HTML Filter]] ||
|- valign="top"
| HTML (Well-formed, and XHTML) || .html, .htm|| <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] ||
|- valign="top"
| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]]
|- valign="top"
| Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft PowerPoint 2007/2010 || .pptx, .pptm, .potx, .potm, .ppsx, .ppsm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Visio || .vsdx, .vsdm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| Microsoft Word 2007/2010 || .docx, .docm, .dotx, .dotm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
| MIF || .mif || <code>okf_mif</code> || [[MIF Filter]] ||
|- valign="top"
| Moses Text || .txt || <code>okf_mosestext</code> || [[Moses Text Filter]] ||
|- valign="top"
| OpenOffice.org Calc || .ods, .ots || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Draw || .odg, .otg || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Impress || .odp, .otp || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| OpenOffice.org Writer || .odt, .ott || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
| PDF || .pdf || <code>okf_pdf</code> || [[PDF Filter]] ||
|- valign="top"
| [[Pensieve TM]] || .pentm || <code>okf_pensieve</code> || [[Pensieve TM Filter]] ||
|- valign="top"
| PHP Content || .php || <code>okf_phpcontent</code> || [[PHP Content Filter]] || Can be used as a subfilter only
|- valign="top"
| Plain Text (Line = text unit) || .txt || <code>okf_plaintext</code> || [[ Plain Text Filter]] ||
|- valign="top"
| Plain Text (Paragraph = text unit) || .txt || <code>okf_plaintext_paragraphs</code> || [[Plain Text Filter]] ||
|- valign="top"
| PO || .po || <code>okf_po</code> || [[PO Filter]] ||
|- valign="top"
| PO (Monolingual style) || .po || <code>okf_po-monolingual</code> || [[PO Filter]] ||
|- valign="top"
| Rainbow Translation Kit manifests || .rkm || <code>okf_rainbowkit</code> || [[Rainbow Translation Kit Filter]] || Used as a tkit reader only
|- valign="top"
| Regex (Any text-based format) || .txt || <code>okf_regex</code> || [[Regex Filter]] ||
|- valign="top"
| RDF (Mozilla RDF) || .rdf || <code>okf_xml-MozillaRDF</code> || [[XML Filter]] ||
|- valign="top"
| RESX || .resx || <code>okf_xml-resx</code> || [[XML Filter]] ||
|- valign="top"
| SDLPPX || .sdlppx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDLRPX || .sdlrpx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
| SDL[[XLIFF]] || .sdlxlf || <code>okf_xliff-sdl</code> || [[XLIFF Filter]] ||
|- valign="top"
| Skype Language Files || .lang || <code>okf_properties-skypeLang</code> || [[Properties Filter]] ||
|- valign="top"
| SRT (Sub-Rip Text, sub-titles files) || .srt || <code>okf_regex-srt</code> || [[Regex Filter]] ||
|- valign="top"
| Tab-Delimiter files || .tsv, .txt || <code>okf_table_tsv</code> || [[Table Filter]] ||
|- valign="top"
| Tex files || .tex || <code>okf_tex</code> || [[TEX Filter]] ||
|- valign="top"
| [[TMX]] || .tmx || <code>okf_tmx</code> || [[TMX Filter]] ||
|- valign="top"
| Transifex project || .txp || <code>okf_transifex</code> || [[Transifex Filter]] ||
|- valign="top"
| Trados-Tagged RTF || .rtf || <code>okf_tradosrtf</code> || [[Trados-Tagged RTF Filter]] ||
|- valign="top"
| TS - Qt TS files || .ts || <code>okf_ts</code> || [[TS Filter]] ||
|- valign="top"
| TTX - Trados TagEditor TTX files || .ttx || <code>okf_ttx</code> || [[TTX Filter]] ||
|- valign="top"
| TXML - Wordfast Pro TXML files || .txml || <code>okf_txml</code> || [[TXML Filter]] ||
|- valign="top"
| Vignette Export/Import Content || .xml || <code>okf_vignette</code> || [[Vignette Filter]] ||
|- valign="top"
| XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| WIX (Windows Installer XML) localization files || .wix || <code>okf_xml-WixLocalization</code> || [[XML Filter]] ||
|- valign="top"
| [[XLIFF]] v1.2 || .xlf, .xliff || <code>okf_xliff</code> || [[XLIFF Filter]] ||
|- valign="top"
| [[XLIFF]] v2 || .xlf || <code>okf_xliff2</code> || [[XLIFF-2 Filter]] ||
|- valign="top"
| XML (Generic, using [[ITS]] defaults) || .xml || <code>okf_xml</code> || [[XML Filter]] ||
|- valign="top"
| XML (Generic, using stream reader) || .xml || <code>okf_xmlstream</code> || [[XML Stream Filter]] ||
|- valign="top"
| YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]] ||
|}

Note that most filters allow you to [[Understanding Filter Configurations|create your own configurations]] to support more file formats.

==Code Simplification Rules==

All filters support code simplification rules. By default the [[Inline Codes Simplifier Step]], [[Simplification Filter]] and [[Post-segmentation Inline Codes Removal Step]] maximize the trimming and merging (aka simplification) of inline codes. In some cases this may not be desired. The simplification rules allow you to override the default behavior and prevent specific codes from being trimmed or merged.

===General Syntax===

The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.

For more details see the JavaCC grammar: <code>../okapi-core/src/main/java/net/sf/okapi/core/simplifierrules/SimplifierRules.jj</code>

===Rule Examples===

If Code has any of these flags then don't simplify

<pre>if DELETABLE or ADDABLE or CLONEABLE;</pre>

"=" is string match
Match basic TAGTYPE opening, closing or standalone

<pre>if DATA = "a" and TAGTYPE = OPENING;</pre>

"~" is regex match

<pre>if DATA ~ "a.*";</pre>

You can negate any of the match operators
Don't simplify if the DATA does not match the regex

<pre>if DATA !~ "a.*";</pre>

Match on type, linebreak in this case, don't simplify

<pre>if the Code is a linebreak if TYPE = "lb";</pre>

Don't simplify any rich text types

<pre>if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";</pre>

Expressions can be recursive (supports embedded parens)

<pre>if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));</pre>

===Filter Config Examples===

Examples of using simplifier rules within the filter config formats used by Okapi.

'''YAML:'''

<pre>
simplifierRules: |
if ADDABLE or DELETABLE or CLONEABLE;
if DATA = " " or DATA = "" or DATA = "" or DATA = "</a>";
if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";
</pre>

'''ITS:'''

<pre>
<?xml version="1.0" encoding="UTF-8"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options">

<its:translateRule selector="//*" translate="yes"/>
<its:withinTextRule selector="//codeph" withinText="yes"/>
<its:withinTextRule selector="//ph" withinText="yes"/>
<okp:simplifierRules>
if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</okp:simplifierRules>
</its:rules>
</pre>

'''FPRM (Parameters):'''

<pre>
#v1
extractNotes.b=true
simplifyCodes.b=true
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</pre>

==Font Mapping==

The font mapping can be considered as filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP.

The following font mapping configuration options are available:
* The source language regular expression pattern: <code>en.*</code>, <code>en-UK</code>, etc. It can be left empty to apply the mapping to any source language.
* The target language regular expression pattern: <code>ru.*</code>, <code>ru-RU</code>, etc. It can be left empty to apply the mapping to any target language.
* The source font name regular expression pattern: <code>Arial.*</code>, <code>Times New Roman</code>, etc. It can be left empty to apply the mapping to any source font name found.
* The target font name: <code>Arial</code>, <code>Times New Roman</code>, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.

Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential
substitution of the source font values. I.e. if there is more than one mapping:
# <code>Arial</code> -> <code>Times New Roman</code>
# <code>Times New Roman</code> -> <code>Sans Serif</code>
then the first mapping will produce <code>Times New Roman</code> replacement and the second one will be applied to this new value, thus, ending up with the <code>Sans Serif</code>.

The parameters serialisation format can look like that:

<pre>
fontMappings.0.sourceLocalePattern=en.*
fontMappings.0.targetLocalePattern=ru.*
fontMappings.0.sourceFontPattern=Times.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.1.sourceLocalePattern=ru
fontMappings.1.targetLocalePattern=fr
fontMappings.1.sourceFontPattern=The Sims Sans
fontMappings.1.targetFont=Arial Unicode MS
fontMappings.number.i=2
</pre>

[[Category:Filters]]