Filters: Difference between revisions

From Okapi Framework
Jump to navigation Jump to search
 
(39 intermediate revisions by 4 users not shown)
Line 13: Line 13:
* [[DTD Filter]]
* [[DTD Filter]]
* [[Doxygen Filter]]
* [[Doxygen Filter]]
* [[EPUB Filter]]
* [[HTML Filter]]
* [[HTML Filter]]
* [[HTML5-ITS Filter]]
* [[HTML5-ITS Filter]]
Line 19: Line 20:
* [[JSON Filter]]
* [[JSON Filter]]
* [[Markdown Filter]]
* [[Markdown Filter]]
* [[Message Format Filter]]
* [[MIF Filter]]
* [[MIF Filter]]
* [[Moses Text Filter]]
* [[Moses Text Filter]]
* [[Multi-Parsers Filter]]
* [[OpenOffice Filter]]
* [[OpenOffice Filter]]
* [[OpenXML Filter|OpenXML (MS Office) Filter]]
* [[OpenXML Filter|OpenXML (MS Office) Filter]]
Line 32: Line 35:
* [[Rainbow Translation Kit Filter]]
* [[Rainbow Translation Kit Filter]]
* [[Regex Filter]]
* [[Regex Filter]]
* [[SDL Trados Package Filter]]
* [[Simplification Filter]]
* [[Simplification Filter]]
* [[Table Filter]]
* [[Table Filter]]
Line 42: Line 46:
* [[TXML Filter]]
* [[TXML Filter]]
* [[Wiki Filter]]
* [[Wiki Filter]]
* [[Versified Text Filter]]
* [[WSXZ Package Filter]]
* [[Vignette Filter]]
* [[Vignette Filter]]
* [[XLIFF Filter]]
* [[XLIFF Filter]]
Line 55: Line 59:
The following is a list of some of the file formats supported by the distribution through [[Understanding Filter Configurations|pre-defined configurations]]:
The following is a list of some of the file formats supported by the distribution through [[Understanding Filter Configurations|pre-defined configurations]]:


{| border="1" cellpadding="5" cellspacing="0"
{| border="1" cellpadding="6" cellspacing="0"
|+
|+
| '''Format''' || '''Extensions''' || '''Pre-Defined Configuration''' || '''Filter'''
| '''Format''' || '''Extensions''' || '''Pre-Defined Configuration''' || '''Filter''' || '''Notes'''
|- valign="top"
|- valign="top"
| Android Strings || .xml || <code>okf_xml-AndroidStrings</code> || [[XML Filter]]
| Android Strings || .xml || <code>okf_xml-AndroidStrings</code> || [[XML Filter]] ||
|- valign="top"
|- valign="top"
| CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]]
| Apple Stringsdict || .stringsdict || <code>okf_xml-AppleStringsdict</code> || [[XML Filter]] ||
|- valign="top"
|- valign="top"
| DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]]
| Archive || .zip || <code>okf_archive</code> || [[Archive Filter]] || Meta filter that processes zip files with various formats as one file.
|- valign="top"
|- valign="top"
| DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]]
| Auto Xliff || .xlf, .xliff || <code>okf_autoxliff</code> || [[Auto Xliff Filter]] || Detects the version of an XLIFF file and then hands parsing off to the appropriate filter
|- valign="top"
|- valign="top"
| Doxygen-commented files || .c, .h, cpp || <code>okf_doxygen</code> || [[Doxygen Filter]]
| CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]] ||
|- valign="top"
|- valign="top"
| DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]]
| CSV (Multiple complex sub-formats) || .csv || <code>okf_multiparsers</code> || [[Multi-Parsers Filter]] ||
|- valign="top"
|- valign="top"
| Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]]
| DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]] ||
|- valign="top"
|- valign="top"
| InCopy ICML || .wcml || <code>okf_icml</code> || [[ICML Filter]]
| DocBook v5.0 || .xml || <code>okf_xml-docbook</code> || [[XML Filter]] || Since Okapi 1.42. &lt;footnote> is not handled properly.
|- valign="top"
|- valign="top"
| InDesign IDML || .idml || <code>okf_idml</code> || [[IDML Filter]]
| DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]] ||
|- valign="top"
|- valign="top"
| iOS/Mac Strings|| .strings || <code>okf_regex-macStrings</code> || [[Regex Filter]]
| Doxygen-commented files || .c, .h, cpp || <code>okf_doxygen</code> || [[Doxygen Filter]] ||
|- valign="top"
|- valign="top"
| Java Properties || .properties || <code>okf_properties</code> || [[Properties Filter]]
| DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]] ||
|- valign="top"
|- valign="top"
| Java Properties (Output not escaped) || .properties || <code>okf_properties-outputNotEscaped</code> || [[Properties Filter]]
| EPUB || .epub || <code>okf_epub</code> || [[EPUB Filter]] ||
|- valign="top"
|- valign="top"
| Java XML Properties || .xml || <code>okf_xml-JavaProperties</code> || [[XML Filter]]
| Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]] ||
|- valign="top"
|- valign="top"
| Java XML Properties (HTML strings) || .xml || <code>okf_xmlstream-JavaPropertiesHTML</code> || [[XML Stream Filter]]
| Idiom WorldServer XLIFF || .xlf || <code>okf_xliff-iws</code> || [[XLIFF Filter]] ||
|- valign="top"
|- valign="top"
| JSON || .json || <code>okf_json</code> || [[JSON Filter]]
| InCopy ICML || .wcml || <code>okf_icml</code> || [[ICML Filter]] ||
|- valign="top"
|- valign="top"
| Haiku CatKeys || .catkeys || <code>okf_table_catkeys</code> || [[Table Filter]]
| InDesign IDML || .idml || <code>okf_idml</code> || [[IDML Filter]] ||
|- valign="top"
|- valign="top"
| HTML (any) || .html, .htm || <code>okf_html</code> || [[HTML Filter]]
| iOS/Mac Strings|| .strings || <code>okf_regex-macStrings</code> || [[Regex Filter]] ||
|- valign="top"
|- valign="top"
| HTML (Well-formed, and XHTML) || .html, .htm|| <code>okf_html-wellFormed</code> || [[HTML Filter]]
| Java Properties || .properties || <code>okf_properties</code> || [[Properties Filter]] ||
|- valign="top"
|- valign="top"
| HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]]
| Java Properties (Output not escaped) || .properties || <code>okf_properties-outputNotEscaped</code> || [[Properties Filter]] ||
|- valign="top"
|- valign="top"
| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]]
| Java XML Properties || .xml || <code>okf_xml-JavaProperties</code> || [[XML Filter]] ||
|- valign="top"
|- valign="top"
| Microsoft Excel 2007/2010 || .xslx, .xltx, || <code>okf_openxml</code> || [[OpenXML Filter]]
| Java XML Properties (HTML strings) || .xml || <code>okf_xmlstream-JavaPropertiesHTML</code> || [[XML Stream Filter]] ||
|- valign="top"
|- valign="top"
| Microsoft PowerPoint 2007/2010 || .pptx, .potx || <code>okf_openxml</code> || [[OpenXML Filter]]
| JSON || .json || <code>okf_json</code> || [[JSON Filter]] ||
|- valign="top"
|- valign="top"
| Microsoft Word 2007/2010 || .docx, dotx || <code>okf_openxml</code> || [[OpenXML Filter]]
| Haiku CatKeys || .catkeys || <code>okf_table_catkeys</code> || [[Table Filter]] ||
|- valign="top"
|- valign="top"
| MIF || .mif || <code>okf_mif</code> || [[MIF Filter]]
| HTML (any) || .html, .htm || <code>okf_html</code> || [[HTML Filter]] ||
|- valign="top"
|- valign="top"
| Moses Text || .txt || <code>okf_mosestext</code> || [[Moses Text Filter]]
| HTML (Well-formed, and XHTML) || .html, .htm|| <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
|- valign="top"
| OpenOffice.org Calc || .ods, .ots || <code>okf_openoffice</code> || [[OpenOffice Filter]]
| HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] ||
|- valign="top"
|- valign="top"
| OpenOffice.org Draw || .odg, .otg || <code>okf_openoffice</code> || [[OpenOffice Filter]]
| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]] ||
|- valign="top"
|- valign="top"
| OpenOffice.org Impress || .odp, .otp || <code>okf_openoffice</code> || [[OpenOffice Filter]]
| Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
|- valign="top"
| OpenOffice.org Writer || .odt, .ott || <code>okf_openoffice</code> || [[OpenOffice Filter]]
| Microsoft PowerPoint 2007/2010 || .pptx, .pptm, .potx, .potm, .ppsx, .ppsm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
|- valign="top"
| [[Pensieve TM]] || .pentm || <code>okf_pensieve</code> || [[Pensieve TM Filter]]
| Microsoft Visio || .vsdx, .vsdm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
|- valign="top"
| PHP Content || .php || <code>okf_phpcontent</code> || [[PHP Content Filter]]
| Microsoft Word 2007/2010 || .docx, .docm, .dotx, .dotm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
|- valign="top"
|- valign="top"
| Plain Text (Line = text unit) || .txt || <code>okf_plaintext</code> || [[ Plain Text Filter]]
| MIF || .mif || <code>okf_mif</code> || [[MIF Filter]] ||
|- valign="top"
|- valign="top"
| Plain Text (Paragraph = text unit) || .txt || <code>okf_plaintext_paragraphs</code> || [[Plain Text Filter]]
| Moses Text || .txt || <code>okf_mosestext</code> || [[Moses Text Filter]] ||
|- valign="top"
|- valign="top"
| PO || .po || <code>okf_po</code> || [[PO Filter]]
| OpenOffice.org Calc || .ods, .ots || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
|- valign="top"
| PO (Monolingual style) || .po || <code>okf_po-monolingual</code> || [[PO Filter]]
| OpenOffice.org Draw || .odg, .otg || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
|- valign="top"
| Rainbow Translation Kit manifests || .rkm || <code>okf_rainbowkit</code> || [[Rainbow Translation Kit Filter]]
| OpenOffice.org Impress || .odp, .otp || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
|- valign="top"
| RDF (Mozilla RDF) || .rdf || <code>okf_xml-MozillaRDF</code> || [[XML Filter]]
| OpenOffice.org Writer || .odt, .ott || <code>okf_odf</code> || [[OpenOffice Filter]] ||
|- valign="top"
|- valign="top"
| RESX || .resx || <code>okf_xml-resx</code> || [[XML Filter]]
| PDF || .pdf || <code>okf_pdf</code> || [[PDF Filter]] ||
|- valign="top"
|- valign="top"
| SDL[[XLIFF]] || .sdlxlf || <code>okf_xliff</code> || [[XLIFF Filter]]
| [[Pensieve TM]] || .pentm || <code>okf_pensieve</code> || [[Pensieve TM Filter]] ||
|- valign="top"
|- valign="top"
| Skype Language Files || .lang || <code>okf_properties-skypeLang</code> || [[Properties Filter]]
| PHP Content || .php || <code>okf_phpcontent</code> || [[PHP Content Filter]] || Can be used as a subfilter only
|- valign="top"
|- valign="top"
| SRT (Sub-Rip Text, sub-titles files) || .srt || <code>okf_regex-srt</code> || [[Regex Filter]]
| Plain Text (Line = text unit) || .txt || <code>okf_plaintext</code> || [[ Plain Text Filter]] ||
|- valign="top"
|- valign="top"
| Tab-Delimiter files || .tsv, .txt || <code>okf_table_tsv</code> || [[Table Filter]]
| Plain Text (Paragraph = text unit) || .txt || <code>okf_plaintext_paragraphs</code> || [[Plain Text Filter]] ||
|- valign="top"
|- valign="top"
| [[TMX]] || .tmx || <code>okf_tmx</code> || [[TMX Filter]]
| PO || .po || <code>okf_po</code> || [[PO Filter]] ||
|- valign="top"
|- valign="top"
| Transifex project || .txp || <code>okf_transifex</code> || [[Transifex Filter]]
| PO (Monolingual style) || .po || <code>okf_po-monolingual</code> || [[PO Filter]] ||
|- valign="top"
|- valign="top"
| Trados-Tagged RTF || .rtf || <code>okf_tradosrtf</code> || [[Trados-Tagged RTF Filter]]
| Rainbow Translation Kit manifests || .rkm || <code>okf_rainbowkit</code> || [[Rainbow Translation Kit Filter]] || Used as a tkit reader only
|- valign="top"
|- valign="top"
| TS - Qt TS files || .ts || <code>okf_ts</code> || [[TS Filter]]
| Regex (Any text-based format) || .txt || <code>okf_regex</code> || [[Regex Filter]] ||
|- valign="top"
|- valign="top"
| TTX - Trados TagEditor TTX files || .ttx || <code>okf_ttx</code> || [[TTX Filter]]
| RDF (Mozilla RDF) || .rdf || <code>okf_xml-MozillaRDF</code> || [[XML Filter]] ||
|- valign="top"
|- valign="top"
| TXML - Wordfast Pro TXML files || .txml || <code>okf_txml</code> || [[TXML Filter]]
| RESX || .resx || <code>okf_xml-resx</code> || [[XML Filter]] ||
|- valign="top"
|- valign="top"
| Versified Text || .vrsz || <code>okf_versifiedtxt</code> || [[Versified Text Filter]]
| SDLPPX || .sdlppx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
|- valign="top"
| Vignette Export/Import Content || .xml || <code>okf_vignette</code> || [[Vignette Filter]]
| SDLRPX || .sdlrpx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] ||
|- valign="top"
|- valign="top"
| XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]]
| SDL[[XLIFF]] || .sdlxlf || <code>okf_xliff-sdl</code> || [[XLIFF Filter]] ||
|- valign="top"
|- valign="top"
| WIX (Windows Installer XML) localization files || .wix || <code>okf_xml-WixLocalization</code> || [[XML Filter]]
| Skype Language Files || .lang || <code>okf_properties-skypeLang</code> || [[Properties Filter]] ||
|- valign="top"
|- valign="top"
| [[XLIFF]] v1.2 || .xlf, .xliff || <code>okf_xliff</code> || [[XLIFF Filter]]
| SRT (Sub-Rip Text, sub-titles files) || .srt || <code>okf_regex-srt</code> || [[Regex Filter]] ||
|- valign="top"
|- valign="top"
| [[XLIFF]] v2 || .xlf || <code>okf_xliff2</code> || [[XLIFF-2 Filter]]
| Tab-Delimiter files || .tsv, .txt || <code>okf_table_tsv</code> || [[Table Filter]] ||
|- valign="top"
|- valign="top"
| XML (Generic, using [[ITS]] defaults) || .xml || <code>okf_xml</code> || [[XML Filter]]
| Tex files || .tex || <code>okf_tex</code> || [[TEX Filter]] ||
|- valign="top"
|- valign="top"
| XML (Generic, using stream reader) || .xml || <code>okf_xmlstream</code> || [[XML Stream Filter]]
| [[TMX]] || .tmx || <code>okf_tmx</code> || [[TMX Filter]] ||
|- valign="top"
|- valign="top"
| YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]]
| Transifex project || .txp || <code>okf_transifex</code> || [[Transifex Filter]] ||
|- valign="top"
| Trados-Tagged RTF || .rtf || <code>okf_tradosrtf</code> || [[Trados-Tagged RTF Filter]] ||
|- valign="top"
| TS - Qt TS files || .ts || <code>okf_ts</code> || [[TS Filter]] ||
|- valign="top"
| TTX - Trados TagEditor TTX files || .ttx || <code>okf_ttx</code> || [[TTX Filter]] ||
|- valign="top"
| TXML - Wordfast Pro TXML files || .txml || <code>okf_txml</code> || [[TXML Filter]] ||
|- valign="top"
| Vignette Export/Import Content || .xml || <code>okf_vignette</code> || [[Vignette Filter]] ||
|- valign="top"
| WSXZ Package Filter || .wsxz || <code>okf_wsxzpackage</code> || [[WSXZ Package Filter]] ||
|- valign="top"
| XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
|- valign="top"
| WIX (Windows Installer XML) localization files || .wix || <code>okf_xml-WixLocalization</code> || [[XML Filter]] ||
|- valign="top"
| [[XLIFF]] v1.2 || .xlf, .xliff || <code>okf_xliff</code> || [[XLIFF Filter]] ||
|- valign="top"
| [[XLIFF]] v2 || .xlf || <code>okf_xliff2</code> || [[XLIFF-2 Filter]] ||
|- valign="top"
| XML (Generic, using [[ITS]] defaults) || .xml || <code>okf_xml</code> || [[XML Filter]] ||
|- valign="top"
| XML (Generic, using stream reader) || .xml || <code>okf_xmlstream</code> || [[XML Stream Filter]] ||
|- valign="top"
| YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]] ||
|- valign="top"
| Message Format (ICU Message Format Filter) || Any container format that supports subfilters || <code>okf_messageformat</code> || [[Message Format Filter]] ||
|}
|}


Line 184: Line 216:
The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.
The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.


For more details see the JavaCC grammar: <code>../okapi-core/src/main/java/net/sf/okapi/core/simplifierrules/SimplifierRules.jj</code>  
For more details see the JavaCC grammar: <code>../okapi/core/src/main/javacc/SimplifierRules.jj</code>


===Rule Examples===
===Rule Examples===
Line 208: Line 240:
Match on type, linebreak in this case, don't simplify  
Match on type, linebreak in this case, don't simplify  


<pre>if the Code is a linebreak if TYPE = "lb";</pre>
<pre>if TYPE = "lb";</pre>


Don't simplify any rich text types
Don't simplify any rich text types
Line 252: Line 284:
#v1
#v1
extractNotes.b=true
extractNotes.b=true
simplifyCodes.b=true
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
</pre>
==Font Mapping==
The font mapping can be considered as a filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX, PPTX and XLSX documents) filters at the moment.
The following font mapping configuration options are available:
* The source locale regular expression pattern: <code>.*</code>, <code>en.*</code>, <code>en-UK</code>, etc. It can be ommited to apply the mapping to any source locale.
* The target locale regular expression pattern: <code>.*</code>, <code>ru.*</code>, <code>ru-RU</code>, etc. It can be ommited to apply the mapping to any target locale.
* The source font name regular expression pattern: <code>.*</code>, <code>Arial.*</code>, <code>Times New Roman</code>, etc. It can be ommited to apply the mapping to any source font name found.
* The target font name: <code>Arial</code>, <code>Times New Roman</code>, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.
Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential
substitution of the source font values. I.e. if there is more than one mapping:
# <code>Arial</code> -> <code>Times New Roman</code>
# <code>Times New Roman</code> -> <code>Sans Serif</code>
then the first mapping will produce <code>Times New Roman</code> replacement and the second one will be applied to this new value, thus, ending up with the <code>Sans Serif</code>.
The parameters serialisation format can look like that:
<pre>
fontMappings.0.sourceLocalePattern=en.*
fontMappings.0.targetLocalePattern=ru.*
fontMappings.0.sourceFontPattern=Times.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.1.sourceLocalePattern=ru
fontMappings.1.targetLocalePattern=fr
fontMappings.1.sourceFontPattern=The Sims Sans
fontMappings.1.targetFont=Arial Unicode MS
fontMappings.number.i=2
</pre>
When source locale, target locale and source font are omitted:
<pre>
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1
</pre>
And this is the same as the abovementioned:
<pre>
fontMappings.0.sourceLocalePattern=.*
fontMappings.0.targetLocalePattern=.*
fontMappings.0.sourceFontPattern=.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1
</pre>
</pre>


[[Category:Filters]]
[[Category:Filters]]

Latest revision as of 15:54, 16 November 2023

Filters are the components that convert input documents from their native file format into a common internal set of resources that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the Raw Document to Filter Events Step and the re-writing by the Filter Events to Raw Document Step.

Note: The Okapi Filters Plugin for OmegaT allows you to use some of the filters directly from OmegaT.

List of the Filters

The framework distribution comes with the following filters:

Supported File Formats

The following is a list of some of the file formats supported by the distribution through pre-defined configurations:

Format Extensions Pre-Defined Configuration Filter Notes
Android Strings .xml okf_xml-AndroidStrings XML Filter
Apple Stringsdict .stringsdict okf_xml-AppleStringsdict XML Filter
Archive .zip okf_archive Archive Filter Meta filter that processes zip files with various formats as one file.
Auto Xliff .xlf, .xliff okf_autoxliff Auto Xliff Filter Detects the version of an XLIFF file and then hands parsing off to the appropriate filter
CSV (Comma-separated values files) .csv, .txt okf_table_csv Table Filter
CSV (Multiple complex sub-formats) .csv okf_multiparsers Multi-Parsers Filter
DITA .dita, .ditamap, .xml okf_xmlstream-dita XML Stream Filter
DocBook v5.0 .xml okf_xml-docbook XML Filter Since Okapi 1.42. <footnote> is not handled properly.
DokuWiki pages .txt okf_wiki Wiki Filter
Doxygen-commented files .c, .h, cpp okf_doxygen Doxygen Filter
DTD .dtd okf_dtd DTD Filter
EPUB .epub okf_epub EPUB Filter
Fixed-Width Columns Table .txt okf_table_fwc Table Filter
Idiom WorldServer XLIFF .xlf okf_xliff-iws XLIFF Filter
InCopy ICML .wcml okf_icml ICML Filter
InDesign IDML .idml okf_idml IDML Filter
iOS/Mac Strings .strings okf_regex-macStrings Regex Filter
Java Properties .properties okf_properties Properties Filter
Java Properties (Output not escaped) .properties okf_properties-outputNotEscaped Properties Filter
Java XML Properties .xml okf_xml-JavaProperties XML Filter
Java XML Properties (HTML strings) .xml okf_xmlstream-JavaPropertiesHTML XML Stream Filter
JSON .json okf_json JSON Filter
Haiku CatKeys .catkeys okf_table_catkeys Table Filter
HTML (any) .html, .htm okf_html HTML Filter
HTML (Well-formed, and XHTML) .html, .htm okf_html-wellFormed HTML Filter
HTML5 (and XHTML5) .html, .htm okf_itshtml5 HTML5-ITS Filter
Markdown .md okf_markdown Markdown Filter
Microsoft Excel 2007/2010 .xlsx, .xlsm, .xltx, .xltm okf_openxml OpenXML Filter
Microsoft PowerPoint 2007/2010 .pptx, .pptm, .potx, .potm, .ppsx, .ppsm okf_openxml OpenXML Filter
Microsoft Visio .vsdx, .vsdm okf_openxml OpenXML Filter
Microsoft Word 2007/2010 .docx, .docm, .dotx, .dotm okf_openxml OpenXML Filter
MIF .mif okf_mif MIF Filter
Moses Text .txt okf_mosestext Moses Text Filter
OpenOffice.org Calc .ods, .ots okf_odf OpenOffice Filter
OpenOffice.org Draw .odg, .otg okf_odf OpenOffice Filter
OpenOffice.org Impress .odp, .otp okf_odf OpenOffice Filter
OpenOffice.org Writer .odt, .ott okf_odf OpenOffice Filter
PDF .pdf okf_pdf PDF Filter
Pensieve TM .pentm okf_pensieve Pensieve TM Filter
PHP Content .php okf_phpcontent PHP Content Filter Can be used as a subfilter only
Plain Text (Line = text unit) .txt okf_plaintext Plain Text Filter
Plain Text (Paragraph = text unit) .txt okf_plaintext_paragraphs Plain Text Filter
PO .po okf_po PO Filter
PO (Monolingual style) .po okf_po-monolingual PO Filter
Rainbow Translation Kit manifests .rkm okf_rainbowkit Rainbow Translation Kit Filter Used as a tkit reader only
Regex (Any text-based format) .txt okf_regex Regex Filter
RDF (Mozilla RDF) .rdf okf_xml-MozillaRDF XML Filter
RESX .resx okf_xml-resx XML Filter
SDLPPX .sdlppx okf_sdlpackage SDL Trados Package Filter
SDLRPX .sdlrpx okf_sdlpackage SDL Trados Package Filter
SDLXLIFF .sdlxlf okf_xliff-sdl XLIFF Filter
Skype Language Files .lang okf_properties-skypeLang Properties Filter
SRT (Sub-Rip Text, sub-titles files) .srt okf_regex-srt Regex Filter
Tab-Delimiter files .tsv, .txt okf_table_tsv Table Filter
Tex files .tex okf_tex TEX Filter
TMX .tmx okf_tmx TMX Filter
Transifex project .txp okf_transifex Transifex Filter
Trados-Tagged RTF .rtf okf_tradosrtf Trados-Tagged RTF Filter
TS - Qt TS files .ts okf_ts TS Filter
TTX - Trados TagEditor TTX files .ttx okf_ttx TTX Filter
TXML - Wordfast Pro TXML files .txml okf_txml TXML Filter
Vignette Export/Import Content .xml okf_vignette Vignette Filter
WSXZ Package Filter .wsxz okf_wsxzpackage WSXZ Package Filter
XHTML .html, .htm okf_html-wellFormed HTML Filter
WIX (Windows Installer XML) localization files .wix okf_xml-WixLocalization XML Filter
XLIFF v1.2 .xlf, .xliff okf_xliff XLIFF Filter
XLIFF v2 .xlf okf_xliff2 XLIFF-2 Filter
XML (Generic, using ITS defaults) .xml okf_xml XML Filter
XML (Generic, using stream reader) .xml okf_xmlstream XML Stream Filter
YAML (Generic YAML filter) .yml, .yaml okf_yaml YAML Filter
Message Format (ICU Message Format Filter) Any container format that supports subfilters okf_messageformat Message Format Filter

Note that most filters allow you to create your own configurations to support more file formats.

Code Simplification Rules

All filters support code simplification rules. By default the Inline Codes Simplifier Step, Simplification Filter and Post-segmentation Inline Codes Removal Step maximize the trimming and merging (aka simplification) of inline codes. In some cases this may not be desired. The simplification rules allow you to override the default behavior and prevent specific codes from being trimmed or merged.

General Syntax

The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.

For more details see the JavaCC grammar: ../okapi/core/src/main/javacc/SimplifierRules.jj

Rule Examples

If Code has any of these flags then don't simplify

if DELETABLE or ADDABLE or CLONEABLE;

"=" is string match Match basic TAGTYPE opening, closing or standalone

if DATA = "a" and TAGTYPE = OPENING;

"~" is regex match

if DATA ~ "a.*";

You can negate any of the match operators Don't simplify if the DATA does not match the regex

if DATA !~ "a.*";

Match on type, linebreak in this case, don't simplify

if TYPE = "lb";

Don't simplify any rich text types

if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";

Expressions can be recursive (supports embedded parens)

if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));


Filter Config Examples

Examples of using simplifier rules within the filter config formats used by Okapi.

YAML:

simplifierRules: |
  if ADDABLE or DELETABLE or CLONEABLE;
  if DATA = "<br/>" or DATA = "<font>" or DATA = "</font>" or DATA = "</a>";
  if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";

ITS:

<?xml version="1.0" encoding="UTF-8"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options">
<!-- See ITS specification at: http://www.w3.org/TR/its/ -->
 <its:translateRule selector="//*" translate="yes"/>
 <its:withinTextRule selector="//codeph" withinText="yes"/>
 <its:withinTextRule selector="//ph" withinText="yes"/>
 <okp:simplifierRules>
 if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
 </okp:simplifierRules>
</its:rules>

FPRM (Parameters):

#v1
extractNotes.b=true
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";

Font Mapping

The font mapping can be considered as a filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX, PPTX and XLSX documents) filters at the moment.

The following font mapping configuration options are available:

  • The source locale regular expression pattern: .*, en.*, en-UK, etc. It can be ommited to apply the mapping to any source locale.
  • The target locale regular expression pattern: .*, ru.*, ru-RU, etc. It can be ommited to apply the mapping to any target locale.
  • The source font name regular expression pattern: .*, Arial.*, Times New Roman, etc. It can be ommited to apply the mapping to any source font name found.
  • The target font name: Arial, Times New Roman, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.

Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential substitution of the source font values. I.e. if there is more than one mapping:

  1. Arial -> Times New Roman
  2. Times New Roman -> Sans Serif

then the first mapping will produce Times New Roman replacement and the second one will be applied to this new value, thus, ending up with the Sans Serif.

The parameters serialisation format can look like that:

fontMappings.0.sourceLocalePattern=en.*
fontMappings.0.targetLocalePattern=ru.*
fontMappings.0.sourceFontPattern=Times.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.1.sourceLocalePattern=ru
fontMappings.1.targetLocalePattern=fr
fontMappings.1.sourceFontPattern=The Sims Sans
fontMappings.1.targetFont=Arial Unicode MS
fontMappings.number.i=2

When source locale, target locale and source font are omitted:

fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1

And this is the same as the abovementioned:

fontMappings.0.sourceLocalePattern=.*
fontMappings.0.targetLocalePattern=.*
fontMappings.0.sourceFontPattern=.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1