Filters: Difference between revisions
(7 intermediate revisions by 2 users not shown) | |||
Line 18: | Line 18: | ||
* [[IDML Filter]] | * [[IDML Filter]] | ||
* [[JSON Filter]] | * [[JSON Filter]] | ||
* [[Markdown Filter]] | |||
* [[MIF Filter]] | * [[MIF Filter]] | ||
* [[Moses Text Filter]] | * [[Moses Text Filter]] | ||
* [[Multi-Parsers Filter]] | |||
* [[OpenOffice Filter]] | * [[OpenOffice Filter]] | ||
* [[OpenXML Filter|OpenXML (MS Office) Filter]] | * [[OpenXML Filter|OpenXML (MS Office) Filter]] | ||
| | | | ||
* [[PDF Filter]] | |||
* [[Pensieve TM Filter]] | * [[Pensieve TM Filter]] | ||
* [[PHP Content Filter]] | * [[PHP Content Filter]] | ||
Line 30: | Line 33: | ||
* [[Rainbow Translation Kit Filter]] | * [[Rainbow Translation Kit Filter]] | ||
* [[Regex Filter]] | * [[Regex Filter]] | ||
* [[SDL Trados Package Filter]] | |||
* [[Simplification Filter]] | * [[Simplification Filter]] | ||
* [[Table Filter]] | * [[Table Filter]] | ||
* [[TMX Filter]] | * [[TMX Filter]] | ||
* [[Trados-Tagged RTF Filter]] | * [[Trados-Tagged RTF Filter]] | ||
| | |||
* [[Transifex Filter]] | * [[Transifex Filter]] | ||
* [[TS Filter]] | * [[TS Filter]] | ||
* [[TTX Filter]] | * [[TTX Filter]] | ||
Line 95: | Line 99: | ||
| HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] | | HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] | ||
|- valign="top" | |- valign="top" | ||
| Microsoft Excel 2007/2010 || . | | Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]] | ||
|- valign="top" | |||
| Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] | |||
|- valign="top" | |||
| Microsoft PowerPoint 2007/2010 || .pptx, .pptm, .potx, .potm, .ppsx, .ppsm || <code>okf_openxml</code> || [[OpenXML Filter]] | |||
|- valign="top" | |- valign="top" | ||
| Microsoft | | Microsoft Visio || .vsdx, .vsdm || <code>okf_opemxml</code> || [[OpenXML Filter]] | ||
|- valign="top" | |- valign="top" | ||
| Microsoft Word 2007/2010 || .docx, dotx || <code>okf_openxml</code> || [[OpenXML Filter]] | | Microsoft Word 2007/2010 || .docx, .docm, .dotx, .dotm || <code>okf_openxml</code> || [[OpenXML Filter]] | ||
|- valign="top" | |- valign="top" | ||
| MIF || .mif || <code>okf_mif</code> || [[MIF Filter]] | | MIF || .mif || <code>okf_mif</code> || [[MIF Filter]] | ||
Line 131: | Line 139: | ||
| RESX || .resx || <code>okf_xml-resx</code> || [[XML Filter]] | | RESX || .resx || <code>okf_xml-resx</code> || [[XML Filter]] | ||
|- valign="top" | |- valign="top" | ||
| SDL[[XLIFF]] || .sdlxlf || <code>okf_xliff</code> || [[XLIFF Filter]] | | SDLPPX || .sdlppx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] | ||
|- valign="top" | |||
| SDLRPX || .sdlrpx || <code>okf_sdlpackage</code> || [[SDL Trados Package Filter]] | |||
|- valign="top" | |||
| SDL[[XLIFF]] || .sdlxlf || <code>okf_xliff-sdl</code> || [[XLIFF Filter]] | |||
|- valign="top" | |- valign="top" | ||
| Skype Language Files || .lang || <code>okf_properties-skypeLang</code> || [[Properties Filter]] | | Skype Language Files || .lang || <code>okf_properties-skypeLang</code> || [[Properties Filter]] |
Revision as of 07:11, 16 October 2018
Filters are the components that convert input documents from their native file format into a common internal set of resources that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the Raw Document to Filter Events Step and the re-writing by the Filter Events to Raw Document Step.
Note: The Okapi Filters Plugin for OmegaT allows you to use some of the filters directly from OmegaT.
List of the Filters
The framework distribution comes with the following filters:
Supported File Formats
The following is a list of some of the file formats supported by the distribution through pre-defined configurations:
Format | Extensions | Pre-Defined Configuration | Filter |
Android Strings | .xml | okf_xml-AndroidStrings |
XML Filter |
CSV (Comma-separated values files) | .csv, .txt | okf_table_csv |
Table Filter |
DITA | .dita, .ditamap, .xml | okf_xmlstream-dita |
XML Stream Filter |
DokuWiki pages | .txt | okf_wiki |
Wiki Filter |
Doxygen-commented files | .c, .h, cpp | okf_doxygen |
Doxygen Filter |
DTD | .dtd | okf_dtd |
DTD Filter |
Fixed-Width Columns Table | .txt | okf_table_fwc |
Table Filter |
InCopy ICML | .wcml | okf_icml |
ICML Filter |
InDesign IDML | .idml | okf_idml |
IDML Filter |
iOS/Mac Strings | .strings | okf_regex-macStrings |
Regex Filter |
Java Properties | .properties | okf_properties |
Properties Filter |
Java Properties (Output not escaped) | .properties | okf_properties-outputNotEscaped |
Properties Filter |
Java XML Properties | .xml | okf_xml-JavaProperties |
XML Filter |
Java XML Properties (HTML strings) | .xml | okf_xmlstream-JavaPropertiesHTML |
XML Stream Filter |
JSON | .json | okf_json |
JSON Filter |
Haiku CatKeys | .catkeys | okf_table_catkeys |
Table Filter |
HTML (any) | .html, .htm | okf_html |
HTML Filter |
HTML (Well-formed, and XHTML) | .html, .htm | okf_html-wellFormed |
HTML Filter |
HTML5 (and XHTML5) | .html, .htm | okf_itshtml5 |
HTML5-ITS Filter |
Markdown | .md | okf_markdown |
Markdown Filter |
Microsoft Excel 2007/2010 | .xlsx, .xlsm, .xltx, .xltm | okf_openxml |
OpenXML Filter |
Microsoft PowerPoint 2007/2010 | .pptx, .pptm, .potx, .potm, .ppsx, .ppsm | okf_openxml |
OpenXML Filter |
Microsoft Visio | .vsdx, .vsdm | okf_opemxml |
OpenXML Filter |
Microsoft Word 2007/2010 | .docx, .docm, .dotx, .dotm | okf_openxml |
OpenXML Filter |
MIF | .mif | okf_mif |
MIF Filter |
Moses Text | .txt | okf_mosestext |
Moses Text Filter |
OpenOffice.org Calc | .ods, .ots | okf_openoffice |
OpenOffice Filter |
OpenOffice.org Draw | .odg, .otg | okf_openoffice |
OpenOffice Filter |
OpenOffice.org Impress | .odp, .otp | okf_openoffice |
OpenOffice Filter |
OpenOffice.org Writer | .odt, .ott | okf_openoffice |
OpenOffice Filter |
Pensieve TM | .pentm | okf_pensieve |
Pensieve TM Filter |
PHP Content | .php | okf_phpcontent |
PHP Content Filter |
Plain Text (Line = text unit) | .txt | okf_plaintext |
Plain Text Filter |
Plain Text (Paragraph = text unit) | .txt | okf_plaintext_paragraphs |
Plain Text Filter |
PO | .po | okf_po |
PO Filter |
PO (Monolingual style) | .po | okf_po-monolingual |
PO Filter |
Rainbow Translation Kit manifests | .rkm | okf_rainbowkit |
Rainbow Translation Kit Filter |
RDF (Mozilla RDF) | .rdf | okf_xml-MozillaRDF |
XML Filter |
RESX | .resx | okf_xml-resx |
XML Filter |
SDLPPX | .sdlppx | okf_sdlpackage |
SDL Trados Package Filter |
SDLRPX | .sdlrpx | okf_sdlpackage |
SDL Trados Package Filter |
SDLXLIFF | .sdlxlf | okf_xliff-sdl |
XLIFF Filter |
Skype Language Files | .lang | okf_properties-skypeLang |
Properties Filter |
SRT (Sub-Rip Text, sub-titles files) | .srt | okf_regex-srt |
Regex Filter |
Tab-Delimiter files | .tsv, .txt | okf_table_tsv |
Table Filter |
TMX | .tmx | okf_tmx |
TMX Filter |
Transifex project | .txp | okf_transifex |
Transifex Filter |
Trados-Tagged RTF | .rtf | okf_tradosrtf |
Trados-Tagged RTF Filter |
TS - Qt TS files | .ts | okf_ts |
TS Filter |
TTX - Trados TagEditor TTX files | .ttx | okf_ttx |
TTX Filter |
TXML - Wordfast Pro TXML files | .txml | okf_txml |
TXML Filter |
Versified Text | .vrsz | okf_versifiedtxt |
Versified Text Filter |
Vignette Export/Import Content | .xml | okf_vignette |
Vignette Filter |
XHTML | .html, .htm | okf_html-wellFormed |
HTML Filter |
WIX (Windows Installer XML) localization files | .wix | okf_xml-WixLocalization |
XML Filter |
XLIFF v1.2 | .xlf, .xliff | okf_xliff |
XLIFF Filter |
XLIFF v2 | .xlf | okf_xliff2 |
XLIFF-2 Filter |
XML (Generic, using ITS defaults) | .xml | okf_xml |
XML Filter |
XML (Generic, using stream reader) | .xml | okf_xmlstream |
XML Stream Filter |
YAML (Generic YAML filter) | .yml, .yaml | okf_yaml |
YAML Filter |
Note that most filters allow you to create your own configurations to support more file formats.
Code Simplification Rules
All filters support code simplification rules. By default the Inline Codes Simplifier Step, Simplification Filter and Post-segmentation Inline Codes Removal Step maximize the trimming and merging (aka simplification) of inline codes. In some cases this may not be desired. The simplification rules allow you to override the default behavior and prevent specific codes from being trimmed or merged.
General Syntax
The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.
For more details see the JavaCC grammar: ../okapi-core/src/main/java/net/sf/okapi/core/simplifierrules/SimplifierRules.jj
Rule Examples
If Code has any of these flags then don't simplify
if DELETABLE or ADDABLE or CLONEABLE;
"=" is string match Match basic TAGTYPE opening, closing or standalone
if DATA = "a" and TAGTYPE = OPENING;
"~" is regex match
if DATA ~ "a.*";
You can negate any of the match operators Don't simplify if the DATA does not match the regex
if DATA !~ "a.*";
Match on type, linebreak in this case, don't simplify
if the Code is a linebreak if TYPE = "lb";
Don't simplify any rich text types
if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";
Expressions can be recursive (supports embedded parens)
if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));
Filter Config Examples
Examples of using simplifier rules within the filter config formats used by Okapi.
YAML:
simplifierRules: | if ADDABLE or DELETABLE or CLONEABLE; if DATA = "<br/>" or DATA = "<font>" or DATA = "</font>" or DATA = "</a>"; if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";
ITS:
<?xml version="1.0" encoding="UTF-8"?> <its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options"> <!-- See ITS specification at: http://www.w3.org/TR/its/ --> <its:translateRule selector="//*" translate="yes"/> <its:withinTextRule selector="//codeph" withinText="yes"/> <its:withinTextRule selector="//ph" withinText="yes"/> <okp:simplifierRules> if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+"; </okp:simplifierRules> </its:rules>
FPRM (Parameters):
#v1 extractNotes.b=true simplifyCodes.b=true simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";