Filters
Filters are the components that convert input documents from their native file format into a common internal set of resources that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the Raw Document to Filter Events Step and the re-writing by the Filter Events to Raw Document Step.
Note: The Okapi Filters Plugin for OmegaT allows you to use some of the filters directly from OmegaT.
List of the Filters
The framework distribution comes with the following filters:
Supported File Formats
The following is a list of some of the file formats supported by the distribution through pre-defined configurations:
Format | Extensions | Pre-Defined Configuration | Filter | Notes |
Android Strings | .xml | okf_xml-AndroidStrings |
XML Filter | |
Apple Stringsdict | .stringsdict | okf_xml-AppleStringsdict |
XML Filter | |
Archive | .zip | okf_archive |
Archive Filter | Meta filter that processes zip files with various formats as one file. |
Auto Xliff | .xlf, .xliff | okf_autoxliff |
Auto Xliff Filter | Detects the version of an XLIFF file and then hands parsing off to the appropriate filter |
CSV (Comma-separated values files) | .csv, .txt | okf_table_csv |
Table Filter | |
CSV (Multiple complex sub-formats) | .csv | okf_multiparsers |
Multi-Parsers Filter | |
DITA | .dita, .ditamap, .xml | okf_xmlstream-dita |
XML Stream Filter | |
DocBook v5.0 | .xml | okf_xml-docbook |
XML Filter | Since Okapi 1.42. <footnote> is not handled properly. |
DokuWiki pages | .txt | okf_wiki |
Wiki Filter | |
Doxygen-commented files | .c, .h, cpp | okf_doxygen |
Doxygen Filter | |
DTD | .dtd | okf_dtd |
DTD Filter | |
EPUB | .epub | okf_epub |
EPUB Filter | |
Fixed-Width Columns Table | .txt | okf_table_fwc |
Table Filter | |
Idiom WorldServer XLIFF | .xlf | okf_xliff-iws |
XLIFF Filter | |
InCopy ICML | .wcml | okf_icml |
ICML Filter | |
InDesign IDML | .idml | okf_idml |
IDML Filter | |
iOS/Mac Strings | .strings | okf_regex-macStrings |
Regex Filter | |
Java Properties | .properties | okf_properties |
Properties Filter | |
Java Properties (Output not escaped) | .properties | okf_properties-outputNotEscaped |
Properties Filter | |
Java XML Properties | .xml | okf_xml-JavaProperties |
XML Filter | |
Java XML Properties (HTML strings) | .xml | okf_xmlstream-JavaPropertiesHTML |
XML Stream Filter | |
JSON | .json | okf_json |
JSON Filter | |
Haiku CatKeys | .catkeys | okf_table_catkeys |
Table Filter | |
HTML (any) | .html, .htm | okf_html |
HTML Filter | |
HTML (Well-formed, and XHTML) | .html, .htm | okf_html-wellFormed |
HTML Filter | |
HTML5 (and XHTML5) | .html, .htm | okf_itshtml5 |
HTML5-ITS Filter | |
Markdown | .md | okf_markdown |
Markdown Filter | |
Microsoft Excel 2007/2010 | .xlsx, .xlsm, .xltx, .xltm | okf_openxml |
OpenXML Filter | |
Microsoft PowerPoint 2007/2010 | .pptx, .pptm, .potx, .potm, .ppsx, .ppsm | okf_openxml |
OpenXML Filter | |
Microsoft Visio | .vsdx, .vsdm | okf_openxml |
OpenXML Filter | |
Microsoft Word 2007/2010 | .docx, .docm, .dotx, .dotm | okf_openxml |
OpenXML Filter | |
MIF | .mif | okf_mif |
MIF Filter | |
Moses Text | .txt | okf_mosestext |
Moses Text Filter | |
OpenOffice.org Calc | .ods, .ots | okf_odf |
OpenOffice Filter | |
OpenOffice.org Draw | .odg, .otg | okf_odf |
OpenOffice Filter | |
OpenOffice.org Impress | .odp, .otp | okf_odf |
OpenOffice Filter | |
OpenOffice.org Writer | .odt, .ott | okf_odf |
OpenOffice Filter | |
okf_pdf |
PDF Filter | |||
Pensieve TM | .pentm | okf_pensieve |
Pensieve TM Filter | |
PHP Content | .php | okf_phpcontent |
PHP Content Filter | Can be used as a subfilter only |
Plain Text (Line = text unit) | .txt | okf_plaintext |
Plain Text Filter | |
Plain Text (Paragraph = text unit) | .txt | okf_plaintext_paragraphs |
Plain Text Filter | |
PO | .po | okf_po |
PO Filter | |
PO (Monolingual style) | .po | okf_po-monolingual |
PO Filter | |
Rainbow Translation Kit manifests | .rkm | okf_rainbowkit |
Rainbow Translation Kit Filter | Used as a tkit reader only |
Regex (Any text-based format) | .txt | okf_regex |
Regex Filter | |
RDF (Mozilla RDF) | .rdf | okf_xml-MozillaRDF |
XML Filter | |
RESX | .resx | okf_xml-resx |
XML Filter | |
SDLPPX | .sdlppx | okf_sdlpackage |
SDL Trados Package Filter | |
SDLRPX | .sdlrpx | okf_sdlpackage |
SDL Trados Package Filter | |
SDLXLIFF | .sdlxlf | okf_xliff-sdl |
XLIFF Filter | |
Skype Language Files | .lang | okf_properties-skypeLang |
Properties Filter | |
SRT (Sub-Rip Text, sub-titles files) | .srt | okf_regex-srt |
Regex Filter | |
Tab-Delimiter files | .tsv, .txt | okf_table_tsv |
Table Filter | |
Tex files | .tex | okf_tex |
TEX Filter | |
TMX | .tmx | okf_tmx |
TMX Filter | |
Transifex project | .txp | okf_transifex |
Transifex Filter | |
Trados-Tagged RTF | .rtf | okf_tradosrtf |
Trados-Tagged RTF Filter | |
TS - Qt TS files | .ts | okf_ts |
TS Filter | |
TTX - Trados TagEditor TTX files | .ttx | okf_ttx |
TTX Filter | |
TXML - Wordfast Pro TXML files | .txml | okf_txml |
TXML Filter | |
Vignette Export/Import Content | .xml | okf_vignette |
Vignette Filter | |
WSXZ Package Filter | .wsxz | okf_wsxzpackage |
WSXZ Package Filter | |
XHTML | .html, .htm | okf_html-wellFormed |
HTML Filter | |
WIX (Windows Installer XML) localization files | .wix | okf_xml-WixLocalization |
XML Filter | |
XLIFF v1.2 | .xlf, .xliff | okf_xliff |
XLIFF Filter | |
XLIFF v2 | .xlf | okf_xliff2 |
XLIFF-2 Filter | |
XML (Generic, using ITS defaults) | .xml | okf_xml |
XML Filter | |
XML (Generic, using stream reader) | .xml | okf_xmlstream |
XML Stream Filter | |
YAML (Generic YAML filter) | .yml, .yaml | okf_yaml |
YAML Filter |
Note that most filters allow you to create your own configurations to support more file formats.
Code Simplification Rules
All filters support code simplification rules. By default the Inline Codes Simplifier Step, Simplification Filter and Post-segmentation Inline Codes Removal Step maximize the trimming and merging (aka simplification) of inline codes. In some cases this may not be desired. The simplification rules allow you to override the default behavior and prevent specific codes from being trimmed or merged.
General Syntax
The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.
For more details see the JavaCC grammar: ../okapi-core/src/main/java/net/sf/okapi/core/simplifierrules/SimplifierRules.jj
Rule Examples
If Code has any of these flags then don't simplify
if DELETABLE or ADDABLE or CLONEABLE;
"=" is string match Match basic TAGTYPE opening, closing or standalone
if DATA = "a" and TAGTYPE = OPENING;
"~" is regex match
if DATA ~ "a.*";
You can negate any of the match operators Don't simplify if the DATA does not match the regex
if DATA !~ "a.*";
Match on type, linebreak in this case, don't simplify
if TYPE = "lb";
Don't simplify any rich text types
if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";
Expressions can be recursive (supports embedded parens)
if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));
Filter Config Examples
Examples of using simplifier rules within the filter config formats used by Okapi.
YAML:
simplifierRules: | if ADDABLE or DELETABLE or CLONEABLE; if DATA = "<br/>" or DATA = "<font>" or DATA = "</font>" or DATA = "</a>"; if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";
ITS:
<?xml version="1.0" encoding="UTF-8"?> <its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options"> <!-- See ITS specification at: http://www.w3.org/TR/its/ --> <its:translateRule selector="//*" translate="yes"/> <its:withinTextRule selector="//codeph" withinText="yes"/> <its:withinTextRule selector="//ph" withinText="yes"/> <okp:simplifierRules> if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+"; </okp:simplifierRules> </its:rules>
FPRM (Parameters):
#v1 extractNotes.b=true simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
Font Mapping
The font mapping can be considered as a filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX, PPTX and XLSX documents) filters at the moment.
The following font mapping configuration options are available:
- The source locale regular expression pattern:
.*
,en.*
,en-UK
, etc. It can be ommited to apply the mapping to any source locale. - The target locale regular expression pattern:
.*
,ru.*
,ru-RU
, etc. It can be ommited to apply the mapping to any target locale. - The source font name regular expression pattern:
.*
,Arial.*
,Times New Roman
, etc. It can be ommited to apply the mapping to any source font name found. - The target font name:
Arial
,Times New Roman
, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.
Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential substitution of the source font values. I.e. if there is more than one mapping:
Arial
->Times New Roman
Times New Roman
->Sans Serif
then the first mapping will produce Times New Roman
replacement and the second one will be applied to this new value, thus, ending up with the Sans Serif
.
The parameters serialisation format can look like that:
fontMappings.0.sourceLocalePattern=en.* fontMappings.0.targetLocalePattern=ru.* fontMappings.0.sourceFontPattern=Times.* fontMappings.0.targetFont=Arial Unicode MS fontMappings.1.sourceLocalePattern=ru fontMappings.1.targetLocalePattern=fr fontMappings.1.sourceFontPattern=The Sims Sans fontMappings.1.targetFont=Arial Unicode MS fontMappings.number.i=2
When source locale, target locale and source font are omitted:
fontMappings.0.targetFont=Arial Unicode MS fontMappings.number.i=1
And this is the same as the abovementioned:
fontMappings.0.sourceLocalePattern=.* fontMappings.0.targetLocalePattern=.* fontMappings.0.sourceFontPattern=.* fontMappings.0.targetFont=Arial Unicode MS fontMappings.number.i=1