Filters: Difference between revisions

Latest revision as of 19:23, 23 March 2026

Filters are the components that convert input documents from their native file format into a common internal set of resources that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the Raw Document to Filter Events Step and the re-writing by the Filter Events to Raw Document Step.

Note: The Okapi Filters Plugin for OmegaT allows you to use some of the filters directly from OmegaT.

List of the Filters

The framework distribution comes with the following filters:

Supported File Formats

The following is a list of some of the file formats supported by the distribution through pre-defined configurations:


Format	Extensions	Pre-Defined Configuration	Filter	Notes
Android Strings	.xml	`okf_xml-AndroidStrings`	XML Filter
Apple Stringsdict	.stringsdict	`okf_xml-AppleStringsdict`	XML Filter
Archive	.zip	`okf_archive`	Archive Filter	Meta filter that processes zip files with various formats as one file.
Auto Xliff	.xlf, .xliff	`okf_autoxliff`	Auto Xliff Filter	Detects the version of an XLIFF file and then hands parsing off to the appropriate filter
AutoCAD DXF	.dxf	`okf_dxf`	DXF Filter	Only supports textual DXF, not binary DXF
CSV (Comma-separated values files)	.csv, .txt	`okf_table_csv`	Table Filter
CSV (Multiple complex sub-formats)	.csv	`okf_multiparsers`	Multi-Parsers Filter
DITA	.dita, .ditamap, .xml	`okf_xmlstream-dita`	XML Stream Filter
DocBook v5.0	.xml	`okf_xml-docbook`	XML Filter	Since Okapi 1.42. <footnote> is not handled properly.
DokuWiki pages	.txt	`okf_wiki`	Wiki Filter
Doxygen-commented files	.c, .h, cpp	`okf_doxygen`	Doxygen Filter
DTD	.dtd	`okf_dtd`	DTD Filter
EPUB	.epub	`okf_epub`	EPUB Filter
Fixed-Width Columns Table	.txt	`okf_table_fwc`	Table Filter
Idiom WorldServer XLIFF	.xlf	`okf_xliff-iws`	XLIFF Filter
InCopy ICML	.wcml	`okf_icml`	ICML Filter
InDesign IDML	.idml	`okf_idml`	IDML Filter
iOS/Mac Strings	.strings	`okf_regex-macStrings`	Regex Filter
Java Properties	.properties	`okf_properties`	Properties Filter
Java Properties (Output not escaped)	.properties	`okf_properties-outputNotEscaped`	Properties Filter
Java XML Properties	.xml	`okf_xml-JavaProperties`	XML Filter
Java XML Properties (HTML strings)	.xml	`okf_xmlstream-JavaPropertiesHTML`	XML Stream Filter
JSON	.json	`okf_json`	JSON Filter
Haiku CatKeys	.catkeys	`okf_table_catkeys`	Table Filter
HTML (any)	.html, .htm	`okf_html`	HTML Filter
HTML (Well-formed, and XHTML)	.html, .htm	`okf_html-wellFormed`	HTML Filter
HTML5 (and XHTML5)	.html, .htm	`okf_itshtml5`	HTML5-ITS Filter
Markdown	.md	`okf_markdown`	Markdown Filter
Microsoft Excel 2007/2010	.xlsx, .xlsm, .xltx, .xltm	`okf_openxml`	OpenXML Filter
Microsoft PowerPoint 2007/2010	.pptx, .pptm, .potx, .potm, .ppsx, .ppsm	`okf_openxml`	OpenXML Filter
Microsoft Visio	.vsdx, .vsdm	`okf_openxml`	OpenXML Filter
Microsoft Word 2007/2010	.docx, .docm, .dotx, .dotm	`okf_openxml`	OpenXML Filter
MIF	.mif	`okf_mif`	MIF Filter
Moses Text	.txt	`okf_mosestext`	Moses Text Filter
OpenOffice.org Calc	.ods, .ots	`okf_odf`	OpenOffice Filter
OpenOffice.org Draw	.odg, .otg	`okf_odf`	OpenOffice Filter
OpenOffice.org Impress	.odp, .otp	`okf_odf`	OpenOffice Filter
OpenOffice.org Writer	.odt, .ott	`okf_odf`	OpenOffice Filter
PDF	.pdf	`okf_pdf`	PDF Filter
Pensieve TM	.pentm	`okf_pensieve`	Pensieve TM Filter
PHP Content	.php	`okf_phpcontent`	PHP Content Filter	Can be used as a subfilter only
Plain Text (Line = text unit)	.txt	`okf_plaintext`	Plain Text Filter
Plain Text (Paragraph = text unit)	.txt	`okf_plaintext_paragraphs`	Plain Text Filter
PO	.po	`okf_po`	PO Filter
PO (Monolingual style)	.po	`okf_po-monolingual`	PO Filter
Rainbow Translation Kit manifests	.rkm	`okf_rainbowkit`	Rainbow Translation Kit Filter	Used as a tkit reader only
Regex (Any text-based format)	.txt	`okf_regex`	Regex Filter
RDF (Mozilla RDF)	.rdf	`okf_xml-MozillaRDF`	XML Filter
RESX	.resx	`okf_xml-resx`	XML Filter
SDLPPX	.sdlppx	`okf_sdlpackage`	SDL Trados Package Filter
SDLRPX	.sdlrpx	`okf_sdlpackage`	SDL Trados Package Filter
SDLXLIFF	.sdlxlf	`okf_xliff-sdl`	XLIFF Filter
Skype Language Files	.lang	`okf_properties-skypeLang`	Properties Filter
SRT (Sub-Rip Text, sub-titles files)	.srt	`okf_regex-srt`	Regex Filter
Tab-Delimiter files	.tsv, .txt	`okf_table_tsv`	Table Filter
Tex files	.tex	`okf_tex`	TEX Filter
TMX	.tmx	`okf_tmx`	TMX Filter
Transifex project	.txp	`okf_transifex`	Transifex Filter
Trados-Tagged RTF	.rtf	`okf_tradosrtf`	Trados-Tagged RTF Filter
TS - Qt TS files	.ts	`okf_ts`	TS Filter
TTX - Trados TagEditor TTX files	.ttx	`okf_ttx`	TTX Filter
TXML - Wordfast Pro TXML files	.txml	`okf_txml`	TXML Filter
Vignette Export/Import Content	.xml	`okf_vignette`	Vignette Filter
WSXZ Package Filter	.wsxz	`okf_wsxzpackage`	WSXZ Package Filter
XHTML	.html, .htm	`okf_html-wellFormed`	HTML Filter
WIX (Windows Installer XML) localization files	.wix	`okf_xml-WixLocalization`	XML Filter
XLIFF v1.2	.xlf, .xliff	`okf_xliff`	XLIFF Filter
XLIFF v2	.xlf	`okf_xliff2`	XLIFF-2 Filter
XML (Generic, using ITS defaults)	.xml	`okf_xml`	XML Filter
XML (Generic, using stream reader)	.xml	`okf_xmlstream`	XML Stream Filter
YAML (Generic YAML filter)	.yml, .yaml	`okf_yaml`	YAML Filter
Message Format (ICU Message Format Filter)	Any container format that supports subfilters	`okf_messageformat`	Message Format Filter

Note that most filters allow you to create your own configurations to support more file formats.

Code Simplification Rules

There are two levels of code simplification: filter and step (the Inline Codes Simplifier Step and Post-segmentation Inline Codes Removal Step). And there are different ways of configuring it:

Firstly, the extraction pipeline can contain just:

- Raw Document to Filter Events Step

At the moment, only IDML Filter, XML Filter and Simplification Filter support this. It should be noted that the last one performs like a wrapper for another filter.

Secondly, the extraction pipeline can look like that:

- Raw Document to Filter Events Step

- Inline Codes Simplifier Step

This is the only way for filters that do not support their own code simplification, and it should be used with care because the final merge may not always handle this correctly. The aforementioned IDML Filter and XML Filter can perform their own simplification, and the added Inline Codes Simplifier Step should not affect the events produced.

Thirdly, the extraction pipeline can consist of:

- Raw Document to Filter Events Step

- Segmentation Step

- Post-segmentation Inline Codes Removal Step

Here, the Post-segmentation Inline Codes Removal Step performs code simplification after segmentation rules are applied, and it may be useful for skipping extra codes between segments.

By default, the Inline Codes Simplifier Step and Post-segmentation Inline Codes Removal Step maximise the trimming and merging (aka simplification) of inline codes. This can be tuned via the following string parameters:

- removeLeadingTrailingCodes - true by default

- mergeCodes - true by default

- rules - empty by default

Only the Inline Codes Simplifier Step configuration can be overridden by the optional filter ones via the following parameters:

- moveLeadingAndTrailingCodesToSkeleton - maps to the removeLeadingTrailingCodes

- mergeAdjacentCodes - maps to the mergeCodes

- simplifierRules - maps to the rules

The simplification rules allow the prevention of specific codes trimming or merging.

General Syntax

The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies, it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.

For more details, see the JavaCC grammar: ../okapi/core/src/main/javacc/SimplifierRules.jj

Rule Examples

If Code has any of these flags, then don't simplify

if DELETABLE or ADDABLE or CLONEABLE;

"=" is string match Match basic TAGTYPE opening, closing or standalone

if DATA = "a" and TAGTYPE = OPENING;

"~" is regex match

if DATA ~ "a.*";

You can negate any of the match operators Don't simplify if the DATA does not match the regex

if DATA !~ "a.*";

Match on type, linebreak in this case, don't simplify

if TYPE = "lb";

Don't simplify any rich text types

if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";

Expressions can be recursive (supports embedded parens)

if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));

Filter Config Examples

Examples of using simplifier rules within the filter config formats used by Okapi.

YAML:

simplifierRules: |
  if ADDABLE or DELETABLE or CLONEABLE;
  if DATA = "<br/>" or DATA = "<font>" or DATA = "</font>" or DATA = "</a>";
  if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";

ITS:

<?xml version="1.0" encoding="UTF-8"?>
<its:rules xmlns:its="http://www.w3.org/2005/11/its" version="1.0" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options">
<!-- See ITS specification at: http://www.w3.org/TR/its/ -->
 <its:translateRule selector="//*" translate="yes"/>
 <its:withinTextRule selector="//codeph" withinText="yes"/>
 <its:withinTextRule selector="//ph" withinText="yes"/>
 <okp:simplifierRules moveLeadingAndTrailingCodesToSkeleton="yes" mergeAdjacentCodes="yes">
 if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
 </okp:simplifierRules>
</its:rules>

FPRM (Parameters):

#v1
extractNotes.b=true
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";

Font Mapping

The font mapping can be considered as a filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX, PPTX and XLSX documents) filters at the moment.

The following font mapping configuration options are available:

The source locale regular expression pattern: .*, en.*, en-UK, etc. It can be ommited to apply the mapping to any source locale.
The target locale regular expression pattern: .*, ru.*, ru-RU, etc. It can be ommited to apply the mapping to any target locale.
The source font name regular expression pattern: .*, Arial.*, Times New Roman, etc. It can be ommited to apply the mapping to any source font name found.
The target font name: Arial, Times New Roman, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.

Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential substitution of the source font values. I.e. if there is more than one mapping:

Arial -> Times New Roman
Times New Roman -> Sans Serif

then the first mapping will produce Times New Roman replacement and the second one will be applied to this new value, thus, ending up with the Sans Serif.

The parameters serialisation format can look like that:

fontMappings.0.sourceLocalePattern=en.*
fontMappings.0.targetLocalePattern=ru.*
fontMappings.0.sourceFontPattern=Times.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.1.sourceLocalePattern=ru
fontMappings.1.targetLocalePattern=fr
fontMappings.1.sourceFontPattern=The Sims Sans
fontMappings.1.targetFont=Arial Unicode MS
fontMappings.number.i=2

When source locale, target locale and source font are omitted:

fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1

And this is the same as the abovementioned:

fontMappings.0.sourceLocalePattern=.*
fontMappings.0.targetLocalePattern=.*
fontMappings.0.sourceFontPattern=.*
fontMappings.0.targetFont=Arial Unicode MS
fontMappings.number.i=1

@@ Line 13: / Line 13: @@
 * [[DTD Filter]]
 * [[Doxygen Filter]]
+* [[DXF Filter]]
+* [[EPUB Filter]]
 * [[HTML Filter]]
 * [[HTML5-ITS Filter]]
@@ Line 19: / Line 21: @@
 * [[JSON Filter]]
 * [[Markdown Filter]]
+* [[Message Format Filter]]
 * [[MIF Filter]]
 * [[Moses Text Filter]]
@@ Line 44: / Line 47: @@
 * [[TXML Filter]]
 * [[Wiki Filter]]
+* [[WSXZ Package Filter]]
 * [[Vignette Filter]]
 * [[XLIFF Filter]]
@@ Line 67: / Line 71: @@
 |- valign="top"
 | Auto Xliff || .xlf, .xliff || <code>okf_autoxliff</code> || [[Auto Xliff Filter]] || Detects the version of an XLIFF file and then hands parsing off to the appropriate filter
+|- valign="top"
+| AutoCAD DXF || .dxf || <code>okf_dxf</code> || [[DXF Filter]] || Only supports textual DXF, not binary DXF
 |- valign="top"
 | CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]] ||
@@ Line 73: / Line 79: @@
 |- valign="top"
 | DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]] ||
+|- valign="top"
+| DocBook v5.0 || .xml || <code>okf_xml-docbook</code> || [[XML Filter]] || Since Okapi 1.42. &lt;footnote> is not handled properly.
 |- valign="top"
 | DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]] ||
@@ Line 79: / Line 87: @@
 |- valign="top"
 | DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]] ||
+|- valign="top"
+| EPUB || .epub || <code>okf_epub</code> || [[EPUB Filter]] ||
 |- valign="top"
 | Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]] ||
@@ Line 108: / Line 118: @@
 | HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] ||
 |- valign="top"
-| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]]
+| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]] ||
 |- valign="top"
 | Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] ||
@@ Line 179: / Line 189: @@
 |- valign="top"
 | Vignette Export/Import Content || .xml || <code>okf_vignette</code> || [[Vignette Filter]] ||
+|- valign="top"
+| WSXZ Package Filter || .wsxz || <code>okf_wsxzpackage</code> || [[WSXZ Package Filter]] ||
 |- valign="top"
 | XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]] ||
@@ Line 193: / Line 205: @@
 |- valign="top"
 | YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]] ||
+|- valign="top"
+| Message Format (ICU Message Format Filter) || Any container format that supports subfilters || <code>okf_messageformat</code> || [[Message Format Filter]] ||
 |}
@@ Line 199: / Line 213: @@
 ==Code Simplification Rules==
-All filters support code simplification rules. By default the [[Inline Codes Simplifier Step]], [[Simplification Filter]] and [[Post-segmentation Inline Codes Removal Step]] maximize the trimming and merging (aka simplification) of inline codes. In some cases this may not be desired. The simplification rules allow you to override the default behavior and prevent specific codes from being trimmed or merged.
+There are two levels of code simplification: filter and step (the [[Inline Codes Simplifier Step]] and [[Post-segmentation Inline Codes Removal Step]]). And there are different ways of configuring it:
+Firstly, the extraction pipeline can contain just:
+: - [[Raw Document to Filter Events Step]]
+At the moment, only [[IDML Filter]], [[XML Filter]] and [[Simplification Filter]] support this. It should be noted that the last one performs like a wrapper for another filter.
+Secondly, the extraction pipeline can look like that:
+: - [[Raw Document to Filter Events Step]]
+: - [[Inline Codes Simplifier Step]]
+This is the only way for filters that do not support their own code simplification, and it should be used with care because the final merge may not always handle this correctly. The aforementioned [[IDML Filter]] and [[XML Filter]] can perform their own simplification, and the added [[Inline Codes Simplifier Step]] should not affect the events produced.
+Thirdly, the extraction pipeline can consist of:
+: - [[Raw Document to Filter Events Step]]
+: - [[Segmentation Step]]
+: - [[Post-segmentation Inline Codes Removal Step]]
+Here, the [[Post-segmentation Inline Codes Removal Step]] performs code simplification after segmentation rules are applied, and it may be useful for skipping extra codes between segments.
+By default, the [[Inline Codes Simplifier Step]] and [[Post-segmentation Inline Codes Removal Step]] maximise the trimming and merging (aka simplification) of inline codes. This can be tuned via the following string parameters:
+: - <code>removeLeadingTrailingCodes</code> - <code>true</code> by default
+: - <code>mergeCodes</code> - <code>true</code> by default
+: - <code>rules</code> - empty by default
+Only the [[Inline Codes Simplifier Step]] configuration can be overridden by the optional filter ones via the following parameters:
+: - <code>moveLeadingAndTrailingCodesToSkeleton</code> - maps to the <code>removeLeadingTrailingCodes</code>
+: - <code>mergeAdjacentCodes</code> - maps to the <code>mergeCodes</code>
+: - <code>simplifierRules</code> - maps to the <code>rules</code>
+The simplification rules allow the prevention of specific codes trimming or merging.
 ===General Syntax===
-The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.
+The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies, it means "do not simplify the match code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.
-For more details see the JavaCC grammar: <code>../okapi-core/src/main/java/net/sf/okapi/core/simplifierrules/SimplifierRules.jj</code>
+For more details, see the JavaCC grammar: <code>../okapi/core/src/main/javacc/SimplifierRules.jj</code>
 ===Rule Examples===
-If Code has any of these flags then don't simplify
+If Code has any of these flags, then don't simplify
 <pre>if DELETABLE or ADDABLE or CLONEABLE;</pre>
@@ Line 229: / Line 273: @@
 Match on type, linebreak in this case, don't simplify
-<pre>if the Code is a linebreak if TYPE = "lb";</pre>
+<pre>if TYPE = "lb";</pre>
 Don't simplify any rich text types
@@ Line 262: / Line 306: @@
   <its:withinTextRule selector="//codeph" withinText="yes"/>
   <its:withinTextRule selector="//ph" withinText="yes"/>
-  <okp:simplifierRules>
+  <okp:simplifierRules moveLeadingAndTrailingCodesToSkeleton="yes" mergeAdjacentCodes="yes">
   if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
   </okp:simplifierRules>
@@ Line 273: / Line 317: @@
 #v1
 extractNotes.b=true
-simplifyCodes.b=true
 simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if DATA ~ ".+";
 </pre>
@@ Line 279: / Line 322: @@
 ==Font Mapping==
-The font mapping can be considered as filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX documents) filters at the moment.
+The font mapping can be considered as a filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX, PPTX and XLSX documents) filters at the moment.
 The following font mapping configuration options are available:
-* The source language regular expression pattern: <code>en.*</code>, <code>en-UK</code>, etc. It can be left empty to apply the mapping to any source language.
+* The source locale regular expression pattern: <code>.*</code>, <code>en.*</code>, <code>en-UK</code>, etc. It can be ommited to apply the mapping to any source locale.
-* The target language regular expression pattern: <code>ru.*</code>, <code>ru-RU</code>, etc. It can be left empty to apply the mapping to any target language.
+* The target locale regular expression pattern: <code>.*</code>, <code>ru.*</code>, <code>ru-RU</code>, etc. It can be ommited to apply the mapping to any target locale.
-* The source font name regular expression pattern: <code>Arial.*</code>, <code>Times New Roman</code>, etc. It can be left empty to apply the mapping to any source font name found.
+* The source font name regular expression pattern: <code>.*</code>, <code>Arial.*</code>, <code>Times New Roman</code>, etc. It can be ommited to apply the mapping to any source font name found.
 * The target font name: <code>Arial</code>, <code>Times New Roman</code>, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.
@@ Line 305: / Line 348: @@
 fontMappings.1.targetFont=Arial Unicode MS
 fontMappings.number.i=2
+</pre>
+When source locale, target locale and source font are omitted:
+<pre>
+fontMappings.0.targetFont=Arial Unicode MS
+fontMappings.number.i=1
+</pre>
+And this is the same as the abovementioned:
+<pre>
+fontMappings.0.sourceLocalePattern=.*
+fontMappings.0.targetLocalePattern=.*
+fontMappings.0.sourceFontPattern=.*
+fontMappings.0.targetFont=Arial Unicode MS
+fontMappings.number.i=1
 </pre>
 [[Category:Filters]]

Filters: Difference between revisions

Latest revision as of 19:23, 23 March 2026

Contents

List of the Filters

Supported File Formats

Code Simplification Rules

General Syntax

Rule Examples

Filter Config Examples

Font Mapping

Navigation menu

Filters: Difference between revisions

Latest revision as of 19:23, 23 March 2026

List of the Filters

Supported File Formats

Code Simplification Rules

General Syntax

Rule Examples

Filter Config Examples

Font Mapping

Navigation menu

Search