Filters
Filters are the components that convert input documents from their native file format into a common internal set of resources that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the Raw Document to Filter Events Step and the re-writing by the Filter Events to Raw Document Step.
Note: The Okapi Filters Plugin for OmegaT allows you to use some of the filters directly from OmegaT.
List of the Filters
The framework distribution comes with the following filters:
Supported File Formats
The following is a list of some of the file formats supported by the distribution through pre-defined configurations:
Format | Extensions | Pre-Defined Configuration | Filter |
Android Strings | .xml | okf_xml-AndroidStrings |
XML Filter |
CSV (Comma-separated values files) | .csv, .txt | okf_table_csv |
Table Filter |
DITA | .dita, .ditamap, .xml | okf_xmlstream-dita |
XML Stream Filter |
DokuWiki pages | .txt | okf_wiki |
Wiki Filter |
Doxygen-commented files | .c, .h, cpp | okf_doxygen |
Doxygen Filter |
DTD | .dtd | okf_dtd |
DTD Filter |
Fixed-Width Columns Table | .txt | okf_table_fwc |
Table Filter |
InCopy ICML | .wcml | okf_icml |
ICML Filter |
InDesign IDML | .idml | okf_idml |
IDML Filter |
iOS/Mac Strings | .strings | okf_regex-macStrings |
Regex Filter |
Java Properties | .properties | okf_properties |
Properties Filter |
Java Properties (Output not escaped) | .properties | okf_properties-outputNotEscaped |
Properties Filter |
Java XML Properties | .xml | okf_xml-JavaProperties |
XML Filter |
Java XML Properties (HTML strings) | .xml | okf_xmlstream-JavaPropertiesHTML |
XML Stream Filter |
JSON | .json | okf_json |
JSON Filter |
Haiku CatKeys | .catkeys | okf_table_catkeys |
Table Filter |
HTML (any) | .html, .htm | okf_html |
HTML Filter |
HTML (Well-formed, and XHTML) | .html, .htm | okf_html-wellFormed |
HTML Filter |
HTML5 (and XHTML5) | .html, .htm | okf_itshtml5 |
HTML5-ITS Filter |
Markdown | .md | okf_markdown |
Markdown Filter |
Microsoft Excel 2007/2010 | .xlsx, .xlsm, .xltx, .xltm | okf_openxml |
OpenXML Filter |
Microsoft PowerPoint 2007/2010 | .pptx, .pptm, .potx, .potm, .ppsx, .ppsm | okf_openxml |
OpenXML Filter |
Microsoft Visio | .vsdx, .vsdm | okf_opemxml |
OpenXML Filter |
Microsoft Word 2007/2010 | .docx, .docm, .dotx, .dotm | okf_openxml |
OpenXML Filter |
MIF | .mif | okf_mif |
MIF Filter |
Moses Text | .txt | okf_mosestext |
Moses Text Filter |
OpenOffice.org Calc | .ods, .ots | okf_openoffice |
OpenOffice Filter |
OpenOffice.org Draw | .odg, .otg | okf_openoffice |
OpenOffice Filter |
OpenOffice.org Impress | .odp, .otp | okf_openoffice |
OpenOffice Filter |
OpenOffice.org Writer | .odt, .ott | okf_openoffice |
OpenOffice Filter |
Pensieve TM | .pentm | okf_pensieve |
Pensieve TM Filter |
PHP Content | .php | okf_phpcontent |
PHP Content Filter |
Plain Text (Line = text unit) | .txt | okf_plaintext |
Plain Text Filter |
Plain Text (Paragraph = text unit) | .txt | okf_plaintext_paragraphs |
Plain Text Filter |
PO | .po | okf_po |
PO Filter |
PO (Monolingual style) | .po | okf_po-monolingual |
PO Filter |
Rainbow Translation Kit manifests | .rkm | okf_rainbowkit |
Rainbow Translation Kit Filter |
RDF (Mozilla RDF) | .rdf | okf_xml-MozillaRDF |
XML Filter |
RESX | .resx | okf_xml-resx |
XML Filter |
SDLPPX | .sdlppx | okf_sdlpackage |
SDL Trados Package Filter |
SDLRPX | .sdlrpx | okf_sdlpackage |
SDL Trados Package Filter |
SDLXLIFF | .sdlxlf | okf_xliff-sdl |
XLIFF Filter |
Skype Language Files | .lang | okf_properties-skypeLang |
Properties Filter |
SRT (Sub-Rip Text, sub-titles files) | .srt | okf_regex-srt |
Regex Filter |
Tab-Delimiter files | .tsv, .txt | okf_table_tsv |
Table Filter |
TMX | .tmx | okf_tmx |
TMX Filter |
Transifex project | .txp | okf_transifex |
Transifex Filter |
Trados-Tagged RTF | .rtf | okf_tradosrtf |
Trados-Tagged RTF Filter |
TS - Qt TS files | .ts | okf_ts |
TS Filter |
TTX - Trados TagEditor TTX files | .ttx | okf_ttx |
TTX Filter |
TXML - Wordfast Pro TXML files | .txml | okf_txml |
TXML Filter |
Versified Text | .vrsz | okf_versifiedtxt |
Versified Text Filter |
Vignette Export/Import Content | .xml | okf_vignette |
Vignette Filter |
XHTML | .html, .htm | okf_html-wellFormed |
HTML Filter |
WIX (Windows Installer XML) localization files | .wix | okf_xml-WixLocalization |
XML Filter |
XLIFF v1.2 | .xlf, .xliff | okf_xliff |
XLIFF Filter |
XLIFF v2 | .xlf | okf_xliff2 |
XLIFF-2 Filter |
XML (Generic, using ITS defaults) | .xml | okf_xml |
XML Filter |
XML (Generic, using stream reader) | .xml | okf_xmlstream |
XML Stream Filter |
YAML (Generic YAML filter) | .yml, .yaml | okf_yaml |
YAML Filter |
Note that most filters allow you to create your own configurations to support more file formats.
Code Simplification Rules
All filters support code simplification rules. By default the Inline Codes Simplifier Step, Simplification Filter and Post-segmentation Inline Codes Removal Step maximize the trimming and merging (aka simplification) of inline codes. In some cases this may not be desired. The simplification rules allow you to override the default behavior and prevent specific codes from being trimmed or merged.
General Syntax
The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.
For more details see the JavaCC grammar: ../okapi-core/src/main/java/net/sf/okapi/core/simplifierrules/SimplifierRules.jj
Rule Examples
If Code has any of these flags then don't simplify
if DELETABLE or ADDABLE or CLONEABLE;
"=" is string match Match basic TAGTYPE opening, closing or standalone
if DATA = "a" and TAGTYPE = OPENING;
"~" is regex match
if DATA ~ "a.*";
You can negate any of the match operators Don't simplify if the DATA does not match the regex
if DATA !~ "a.*";
Match on type, linebreak in this case, don't simplify
if the Code is a linebreak if TYPE = "lb";
Don't simplify any rich text types
if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";
Expressions can be recursive (supports embedded parens)
if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));
Filter Config Examples
Examples of using simplifier rules within the filter config formats used by Okapi.
YAML:
simplifierRules: | if ADDABLE or DELETABLE or CLONEABLE; if DATA = "<br/>" or DATA = "<font>" or DATA = "</font>" or DATA = "</a>"; if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";
ITS:
<?xml version="1.0" encoding="UTF-8"?> <its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options"> <!-- See ITS specification at: http://www.w3.org/TR/its/ --> <its:translateRule selector="//*" translate="yes"/> <its:withinTextRule selector="//codeph" withinText="yes"/> <its:withinTextRule selector="//ph" withinText="yes"/> <okp:simplifierRules> if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+"; </okp:simplifierRules> </its:rules>
FPRM (Parameters):
#v1 extractNotes.b=true simplifyCodes.b=true simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";