Filters: Difference between revisions
Jhargraveiii (talk | contribs) No edit summary |
No edit summary |
||
| (22 intermediate revisions by 4 users not shown) | |||
| Line 13: | Line 13: | ||
* [[DTD Filter]] | * [[DTD Filter]] | ||
* [[Doxygen Filter]] | * [[Doxygen Filter]] | ||
* [[DXF Filter]] | |||
* [[EPUB Filter]] | |||
* [[HTML Filter]] | * [[HTML Filter]] | ||
* [[HTML5-ITS Filter]] | * [[HTML5-ITS Filter]] | ||
| Line 19: | Line 21: | ||
* [[JSON Filter]] | * [[JSON Filter]] | ||
* [[Markdown Filter]] | * [[Markdown Filter]] | ||
* [[Message Format Filter]] | |||
* [[MIF Filter]] | * [[MIF Filter]] | ||
* [[Moses Text Filter]] | * [[Moses Text Filter]] | ||
| Line 44: | Line 47: | ||
* [[TXML Filter]] | * [[TXML Filter]] | ||
* [[Wiki Filter]] | * [[Wiki Filter]] | ||
* [[WSXZ Package Filter]] | |||
* [[Vignette Filter]] | * [[Vignette Filter]] | ||
* [[XLIFF Filter]] | * [[XLIFF Filter]] | ||
| Line 67: | Line 71: | ||
|- valign="top" | |- valign="top" | ||
| Auto Xliff || .xlf, .xliff || <code>okf_autoxliff</code> || [[Auto Xliff Filter]] || Detects the version of an XLIFF file and then hands parsing off to the appropriate filter | | Auto Xliff || .xlf, .xliff || <code>okf_autoxliff</code> || [[Auto Xliff Filter]] || Detects the version of an XLIFF file and then hands parsing off to the appropriate filter | ||
|- valign="top" | |||
| AutoCAD DXF || .dxf || <code>okf_dxf</code> || [[DXF Filter]] || Only supports textual DXF, not binary DXF | |||
|- valign="top" | |- valign="top" | ||
| CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]] || | | CSV (Comma-separated values files) || .csv, .txt || <code>okf_table_csv</code> || [[Table Filter]] || | ||
| Line 73: | Line 79: | ||
|- valign="top" | |- valign="top" | ||
| DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]] || | | DITA || .dita, .ditamap, .xml || <code>okf_xmlstream-dita</code> || [[XML Stream Filter]] || | ||
|- valign="top" | |||
| DocBook v5.0 || .xml || <code>okf_xml-docbook</code> || [[XML Filter]] || Since Okapi 1.42. <footnote> is not handled properly. | |||
|- valign="top" | |- valign="top" | ||
| DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]] || | | DokuWiki pages || .txt || <code>okf_wiki</code> || [[Wiki Filter]] || | ||
| Line 79: | Line 87: | ||
|- valign="top" | |- valign="top" | ||
| DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]] || | | DTD || .dtd || <code>okf_dtd</code> || [[DTD Filter]] || | ||
|- valign="top" | |||
| EPUB || .epub || <code>okf_epub</code> || [[EPUB Filter]] || | |||
|- valign="top" | |- valign="top" | ||
| Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]] || | | Fixed-Width Columns Table || .txt || <code>okf_table_fwc</code> || [[Table Filter]] || | ||
| Line 108: | Line 118: | ||
| HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] || | | HTML5 (and XHTML5) || .html, .htm|| <code>okf_itshtml5</code> || [[HTML5-ITS Filter]] || | ||
|- valign="top" | |- valign="top" | ||
| Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]] | | Markdown || .md || <code>okf_markdown</code> || [[Markdown Filter]] || | ||
|- valign="top" | |- valign="top" | ||
| Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] || | | Microsoft Excel 2007/2010 || .xlsx, .xlsm, .xltx, .xltm || <code>okf_openxml</code> || [[OpenXML Filter]] || | ||
| Line 178: | Line 188: | ||
| TXML - Wordfast Pro TXML files || .txml || <code>okf_txml</code> || [[TXML Filter]] || | | TXML - Wordfast Pro TXML files || .txml || <code>okf_txml</code> || [[TXML Filter]] || | ||
|- valign="top" | |- valign="top" | ||
| | | Vignette Export/Import Content || .xml || <code>okf_vignette</code> || [[Vignette Filter]] || | ||
|- valign="top" | |- valign="top" | ||
| | | WSXZ Package Filter || .wsxz || <code>okf_wsxzpackage</code> || [[WSXZ Package Filter]] || | ||
|- valign="top" | |- valign="top" | ||
| XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]] || | | XHTML || .html, .htm || <code>okf_html-wellFormed</code> || [[HTML Filter]] || | ||
| Line 195: | Line 205: | ||
|- valign="top" | |- valign="top" | ||
| YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]] || | | YAML (Generic YAML filter) || .yml, .yaml || <code>okf_yaml</code> || [[YAML Filter]] || | ||
|- valign="top" | |||
| Message Format (ICU Message Format Filter) || Any container format that supports subfilters || <code>okf_messageformat</code> || [[Message Format Filter]] || | |||
|} | |} | ||
| Line 201: | Line 213: | ||
==Code Simplification Rules== | ==Code Simplification Rules== | ||
There are two levels of code simplification: filter and step (the [[Inline Codes Simplifier Step]] and the [[Post-segmentation Inline Codes Removal Step]]). And there are different ways of configuring it: | |||
Firstly, the extraction pipeline can contain just: | |||
* [[Raw Document to Filter Events Step]] | |||
At the moment, only [[IDML Filter]], [[XML Filter]] and [[Simplification Filter]] support this. It should be noted that the last one performs like a wrapper for another filter. | |||
Secondly, the extraction pipeline can look like that: | |||
* [[Raw Document to Filter Events Step]] | |||
* [[Inline Codes Simplifier Step]] | |||
This is the only way for filters that do not support their own code simplification, and it should be used with care because the final merge may not always handle this correctly. The aforementioned [[IDML Filter]] and [[XML Filter]] can perform their own simplification, and the added [[Inline Codes Simplifier Step]] should not affect the events produced. | |||
Thirdly, the extraction pipeline can consist of: | |||
* [[Raw Document to Filter Events Step]] | |||
* [[Segmentation Step]] | |||
* [[Post-segmentation Inline Codes Removal Step]] | |||
Here, the [[Post-segmentation Inline Codes Removal Step]] performs code simplification after segmentation rules are applied, and it may be useful for skipping extra codes between segments. | |||
By default, the [[Inline Codes Simplifier Step]] and [[Post-segmentation Inline Codes Removal Step]] maximise the trimming and merging (aka simplification) of inline codes. This can be tuned via the following string parameters: | |||
* <code>removeLeadingTrailingCodes</code> - <code>true</code> by default | |||
* <code>mergeCodes</code> - <code>true</code> by default | |||
* <code>rules</code> - empty by default | |||
Only the [[Inline Codes Simplifier Step]] configuration can be overridden by the optional filter ones via the following parameters: | |||
* <code>moveLeadingAndTrailingCodesToSkeleton</code> - maps to the <code>removeLeadingTrailingCodes</code> parameter | |||
* <code>mergeAdjacentCodes</code> - maps to the <code>mergeCodes</code> parameter | |||
* <code>simplifierRules</code> - maps to the <code>rules</code> parameter | |||
The simplification rules allow the prevention of specific codes trimming or merging. When a rule matches a code, it means: '''do not simplify this code'''. | |||
=== General Syntax === | |||
The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies, it means "do not simplify the matched code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together. | |||
For more details, see the JavaCC grammar: <code>../okapi/core/src/main/javacc/SimplifierRules.jj</code> | |||
Each rule starts with <code>if</code> and ends with <code>;</code>. | |||
<pre> | |||
if TYPE = "bold"; | |||
</pre> | |||
Multiple rules are always OR'ed together. In other words, if any rule matches a code, that code is not simplified. | |||
Within one rule, expressions can be combined with <code>and</code>, <code>or</code> and parentheses. | |||
<pre> | |||
if TYPE = "bold" or (DATA = "<br/>" and TAG_TYPE = STANDALONE); | |||
</pre> | |||
The parser supports comments: | |||
<pre> | |||
# This is a line comment | |||
if TYPE = "bold"; | |||
/* | |||
This is a block comment | |||
*/ | |||
if TYPE = "italic"; | |||
</pre> | |||
=== Available Fields and Literals === | |||
The following string fields can be used with quoted string values: | |||
{| class="wikitable" | |||
! Field | |||
! Meaning | |||
|- | |||
| <code>DATA</code> | |||
| The raw/native data of the inline code. For XML/HTML-like tags this is normally the complete raw code, for example <code><a></code>, <code></a></code> or <code><br/></code>. Depending on the filter and the configuration format, the literal value may also contain escaped markup, for example <code>&lt;/a&gt;</code>. <code>DATA</code> is not just the tag name. | |||
|- | |||
| <code>OUTER_DATA</code> | |||
| The complete outer data of the inline code, if available. This is mainly relevant for formats that store inline codes as markup themselves, such as XLIFF or TMX. If no separate outer data is available, this returns the same value as <code>DATA</code>. | |||
|- | |||
| <code>ORIGINAL_ID</code> | |||
| The original inline-code ID from the filtered/input format, if available. If no original ID is available, it behaves like an empty string. | |||
|- | |||
| <code>TYPE</code> | |||
| The abstract type of the inline code. This is often the tag name or a semantic type, for example <code>"a"</code>, <code>"bold"</code>, <code>"italic"</code>, <code>"underline"</code> or <code>"lb"</code>. Opening and closing codes that belong together normally have the same <code>TYPE</code>. | |||
|} | |||
The following field is used with unquoted tag-type literals: | |||
{| class="wikitable" | |||
! Field | |||
! Allowed values | |||
|- | |||
| <code>TAG_TYPE</code> | |||
| <code>OPENING</code>, <code>CLOSING</code> or <code>STANDALONE</code> | |||
|} | |||
Important: <code>TAG_TYPE</code> is written with an underscore. <code>TAGTYPE</code> is invalid. | |||
The tag-type literals are not strings and must not be quoted: | |||
<pre> | |||
if TAG_TYPE = OPENING; | |||
if TAG_TYPE = CLOSING; | |||
if TAG_TYPE = STANDALONE; | |||
</pre> | |||
This is invalid: | |||
<pre> | |||
if TAG_TYPE = "CLOSING"; | |||
</pre> | |||
The following boolean flag literals can be used without an operator: | |||
{| class="wikitable" | |||
! Literal | |||
! Meaning | |||
|- | |||
| <code>ADDABLE</code> | |||
| Matches codes for which <code>code.isAdded()</code> is true. Despite the rule name, this means that the code was added after extraction and was not found in the original source. | |||
|- | |||
| <code>DELETABLE</code> | |||
| Matches codes for which <code>code.isDeleteable()</code> is true. Such a code may be removed from the text, for example a formatting code, unlike a required placeholder such as <code>%s</code>. | |||
|- | |||
| <code>CLONEABLE</code> | |||
| Matches codes for which <code>code.isCloneable()</code> is true. Such a code may be duplicated in the text, for example a formatting code, unlike a placeholder that must occur exactly once. | |||
|} | |||
Since a matching rule means '''do not simplify this code''', a rule such as the following protects all codes that match any of these three flags from trimming and merging: | |||
<pre> | |||
if ADDABLE or DELETABLE or CLONEABLE; | |||
</pre> | |||
This rule is useful if the information expressed by these flags should be preserved and the code should not be merged with neighbouring codes or moved out of the text unit as a leading/trailing code. | |||
=== Operators === | |||
String fields such as <code>DATA</code>, <code>OUTER_DATA</code>, <code>ORIGINAL_ID</code> and <code>TYPE</code> can be matched with quoted string values: | |||
{| class="wikitable" | |||
! Operator | |||
! Meaning | |||
! Example | |||
|- | |||
| <code>=</code> | |||
| Exact string match | |||
| <code>if TYPE = "bold";</code> | |||
|- | |||
| <code>!=</code> | |||
| Negated exact string match | |||
| <code>if TYPE != "bold";</code> | |||
|- | |||
| <code>~</code> | |||
| Regular expression match | |||
| <code>if DATA ~ ".*meta-ref.*";</code> | |||
|- | |||
| <code>!~</code> | |||
| Negated regular expression match | |||
| <code>if DATA !~ ".*meta-ref.*";</code> | |||
|} | |||
< | <code>TAG_TYPE</code> is matched against the unquoted tag-type literals: | ||
<pre> | |||
if TAG_TYPE = OPENING; | |||
if TAG_TYPE != CLOSING; | |||
</pre> | |||
Boolean flag literals are used directly: | |||
<pre> | |||
if DELETABLE; | |||
if DELETABLE or CLONEABLE; | |||
</pre> | |||
=== Rule Examples === | |||
If a code has any of these flags, then do not simplify it: | |||
<pre>if | <pre> | ||
if DELETABLE or ADDABLE or CLONEABLE; | |||
</pre> | |||
Match | Match an opening <code>a</code> code by abstract type: | ||
<pre> | <pre> | ||
if TYPE = "a" and TAG_TYPE = OPENING; | |||
</pre> | |||
Match a closing <code>a</code> code by abstract type: | |||
<pre>if TYPE = " | <pre> | ||
if TYPE = "a" and TAG_TYPE = CLOSING; | |||
</pre> | |||
Match a standalone <code>br</code> code by abstract type: | |||
<pre>if TYPE = " | <pre> | ||
if TYPE = "br" and TAG_TYPE = STANDALONE; | |||
</pre> | |||
Match by raw data. For XML/HTML-like tags, <code>DATA</code> usually contains the complete raw tag, not just the tag name: | |||
<pre> | |||
if DATA = "<a>" and TAG_TYPE = OPENING; | |||
if DATA = "</a>" and TAG_TYPE = CLOSING; | |||
if DATA = "<br/>" and TAG_TYPE = STANDALONE; | |||
</pre> | |||
Depending on the filter and the configuration format in which the rule is stored, XML/HTML characters may need to be escaped. For example, if the literal <code>DATA</code> value is <code>&lt;/a&gt;</code>, use: | |||
<pre> | |||
if DATA = "&lt;/a&gt;" and TAG_TYPE = CLOSING; | |||
</pre> | |||
Use <code>TYPE</code> when you want to match the abstract code type or tag name. Use <code>DATA</code> when you want to match the actual raw/native inline-code content. | |||
Regular expression match: | |||
<pre> | |||
if DATA ~ ".*meta-ref.*"; | |||
</pre> | |||
Regular expression match for opening <code>a</code> tags where the raw tag may contain attributes: | |||
<pre> | |||
if DATA ~ "<a[ >].*" and TAG_TYPE = OPENING; | |||
</pre> | |||
Negated regular expression match: | |||
<pre> | |||
if DATA !~ ".*meta-ref.*"; | |||
</pre> | |||
Match on code type; in this case, do not simplify line break codes: | |||
<pre> | |||
if TYPE = "lb"; | |||
</pre> | |||
Do not simplify rich text types: | |||
<pre> | |||
if TYPE = "bold" or TYPE = "italic" or TYPE = "underline"; | |||
</pre> | |||
Expressions can be recursive and support parentheses: | |||
<pre> | |||
if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline")); | |||
</pre> | |||
Match by original ID, if the filter provides one: | |||
<pre> | |||
if ORIGINAL_ID = "1"; | |||
if ORIGINAL_ID ~ "b[0-9]+"; | |||
</pre> | |||
Match by outer data, for formats that store inline codes as markup themselves, such as XLIFF or TMX: | |||
<pre> | |||
if OUTER_DATA ~ ".*<ph.*"; | |||
</pre> | |||
Match flag literals individually: | |||
<pre> | |||
# Do not simplify codes added after extraction | |||
if ADDABLE; | |||
# Do not simplify codes that may be removed | |||
if DELETABLE; | |||
# Do not simplify codes that may be duplicated | |||
if CLONEABLE; | |||
</pre> | |||
Or combine them in one rule: | |||
<pre> | |||
if ADDABLE or DELETABLE or CLONEABLE; | |||
</pre> | |||
=== Filter Config Examples === | |||
Examples of using simplifier rules within the filter configuration formats used by Okapi. | |||
==== YAML ==== | |||
<pre> | <pre> | ||
simplifierRules: | | simplifierRules: | | ||
if ADDABLE or DELETABLE or CLONEABLE; | if ADDABLE or DELETABLE or CLONEABLE; | ||
if DATA = " | if TYPE = "a" and TAG_TYPE = OPENING; | ||
if DATA ~ "\\ | if TYPE = "a" and TAG_TYPE = CLOSING; | ||
if DATA = "<br/>" or DATA = "<font>" or DATA = "</font>" or DATA = "</a>"; | |||
if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+"; | |||
</pre> | </pre> | ||
==== ITS ==== | |||
ITS rules are XML. Therefore, XML entity escaping happens before the simplifier rule parser sees the rules. | |||
* To pass a literal <code><</code> or <code>></code> character to the rule parser from an ITS file, write <code>&lt;</code> or <code>&gt;</code> in the ITS file. | |||
* To compare against a <code>DATA</code> value that itself contains the literal characters <code>&lt;</code> and <code>&gt;</code>, write <code>&amp;lt;</code> and <code>&amp;gt;</code> in the ITS file. | |||
* The same escaping rules apply to <code>OUTER_DATA</code>. For example, to let the rule parser see a regular expression containing <code><ph</code>, write <code>&lt;ph</code> in the ITS file. | |||
<pre> | <pre> | ||
<?xml version="1.0" encoding="UTF-8"?> | |||
<its:rules | |||
xmlns:its="http://www.w3.org/2005/11/its" | |||
version="1.0" | |||
xmlns:itsx="http://www.w3.org/2008/12/its-extensions" | |||
xmlns:okp="okapi-framework:xmlfilter-options"> | |||
<its:translateRule selector="//*" translate="yes"/> | |||
<its:withinTextRule selector="//codeph" withinText="yes"/> | |||
<its:withinTextRule selector="//ph" withinText="yes"/> | |||
<okp:simplifierRules | |||
moveLeadingAndTrailingCodesToSkeleton="yes" | |||
mergeAdjacentCodes="yes"> | |||
if ADDABLE or DELETABLE or CLONEABLE; | |||
if TYPE = "a" and TAG_TYPE = OPENING; | |||
if TYPE = "a" and TAG_TYPE = CLOSING; | |||
# DATA example: matches a code whose DATA value literally contains &lt;/a&gt;. | |||
# Because this rule is inside ITS/XML, the ampersands are escaped as &amp;. | |||
if DATA = "&amp;lt;/a&amp;gt;" and TAG_TYPE = CLOSING; | |||
# OUTER_DATA example: matches outer inline-code markup such as <ph id="1">code</ph>. | |||
# Because this rule is inside ITS/XML, the < and > characters in the regex are escaped. | |||
if OUTER_DATA ~ ".*&lt;ph[^&gt;]*&gt;.*&lt;/ph&gt;.*"; | |||
</okp:simplifierRules> | |||
</its:rules> | |||
</pre> | </pre> | ||
The <code>DATA</code> example above is for the case where the value stored in <code>DATA</code> is the literal string <code>&lt;/a&gt;</code>. If the value stored in <code>DATA</code> is the raw string <code></a></code> instead, then the ITS file only needs one XML escaping level: | |||
<pre> | |||
<okp:simplifierRules | |||
moveLeadingAndTrailingCodesToSkeleton="yes" | |||
mergeAdjacentCodes="yes"> | |||
if DATA = "&lt;/a&gt;" and TAG_TYPE = CLOSING; | |||
</okp:simplifierRules> | |||
</pre> | |||
==== FPRM / Parameters ==== | |||
<pre> | <pre> | ||
#v1 | #v1 | ||
extractNotes.b=true | extractNotes.b=true | ||
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if TYPE = "a" and TAG_TYPE = OPENING; if TYPE = "a" and TAG_TYPE = CLOSING; | |||
simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if | </pre> | ||
==Font Mapping== | |||
The font mapping can be considered as a filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX, PPTX and XLSX documents) filters at the moment. | |||
The following font mapping configuration options are available: | |||
* The source locale regular expression pattern: <code>.*</code>, <code>en.*</code>, <code>en-UK</code>, etc. It can be ommited to apply the mapping to any source locale. | |||
* The target locale regular expression pattern: <code>.*</code>, <code>ru.*</code>, <code>ru-RU</code>, etc. It can be ommited to apply the mapping to any target locale. | |||
* The source font name regular expression pattern: <code>.*</code>, <code>Arial.*</code>, <code>Times New Roman</code>, etc. It can be ommited to apply the mapping to any source font name found. | |||
* The target font name: <code>Arial</code>, <code>Times New Roman</code>, etc. It should not be empty. And if it is made so, the mapping configuration is ignored. | |||
Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential | |||
substitution of the source font values. I.e. if there is more than one mapping: | |||
# <code>Arial</code> -> <code>Times New Roman</code> | |||
# <code>Times New Roman</code> -> <code>Sans Serif</code> | |||
then the first mapping will produce <code>Times New Roman</code> replacement and the second one will be applied to this new value, thus, ending up with the <code>Sans Serif</code>. | |||
The parameters serialisation format can look like that: | |||
<pre> | |||
fontMappings.0.sourceLocalePattern=en.* | |||
fontMappings.0.targetLocalePattern=ru.* | |||
fontMappings.0.sourceFontPattern=Times.* | |||
fontMappings.0.targetFont=Arial Unicode MS | |||
fontMappings.1.sourceLocalePattern=ru | |||
fontMappings.1.targetLocalePattern=fr | |||
fontMappings.1.sourceFontPattern=The Sims Sans | |||
fontMappings.1.targetFont=Arial Unicode MS | |||
fontMappings.number.i=2 | |||
</pre> | |||
When source locale, target locale and source font are omitted: | |||
<pre> | |||
fontMappings.0.targetFont=Arial Unicode MS | |||
fontMappings.number.i=1 | |||
</pre> | |||
And this is the same as the abovementioned: | |||
<pre> | |||
fontMappings.0.sourceLocalePattern=.* | |||
fontMappings.0.targetLocalePattern=.* | |||
fontMappings.0.sourceFontPattern=.* | |||
fontMappings.0.targetFont=Arial Unicode MS | |||
fontMappings.number.i=1 | |||
</pre> | </pre> | ||
[[Category:Filters]] | [[Category:Filters]] | ||
Latest revision as of 17:09, 29 May 2026
Filters are the components that convert input documents from their native file format into a common internal set of resources that all Okapi components use. The extracted content can be re-written into the original file format. When using the steps, the extraction is done by the Raw Document to Filter Events Step and the re-writing by the Filter Events to Raw Document Step.
Note: The Okapi Filters Plugin for OmegaT allows you to use some of the filters directly from OmegaT.
List of the Filters
The framework distribution comes with the following filters:
Supported File Formats
The following is a list of some of the file formats supported by the distribution through pre-defined configurations:
| Format | Extensions | Pre-Defined Configuration | Filter | Notes |
| Android Strings | .xml | okf_xml-AndroidStrings |
XML Filter | |
| Apple Stringsdict | .stringsdict | okf_xml-AppleStringsdict |
XML Filter | |
| Archive | .zip | okf_archive |
Archive Filter | Meta filter that processes zip files with various formats as one file. |
| Auto Xliff | .xlf, .xliff | okf_autoxliff |
Auto Xliff Filter | Detects the version of an XLIFF file and then hands parsing off to the appropriate filter |
| AutoCAD DXF | .dxf | okf_dxf |
DXF Filter | Only supports textual DXF, not binary DXF |
| CSV (Comma-separated values files) | .csv, .txt | okf_table_csv |
Table Filter | |
| CSV (Multiple complex sub-formats) | .csv | okf_multiparsers |
Multi-Parsers Filter | |
| DITA | .dita, .ditamap, .xml | okf_xmlstream-dita |
XML Stream Filter | |
| DocBook v5.0 | .xml | okf_xml-docbook |
XML Filter | Since Okapi 1.42. <footnote> is not handled properly. |
| DokuWiki pages | .txt | okf_wiki |
Wiki Filter | |
| Doxygen-commented files | .c, .h, cpp | okf_doxygen |
Doxygen Filter | |
| DTD | .dtd | okf_dtd |
DTD Filter | |
| EPUB | .epub | okf_epub |
EPUB Filter | |
| Fixed-Width Columns Table | .txt | okf_table_fwc |
Table Filter | |
| Idiom WorldServer XLIFF | .xlf | okf_xliff-iws |
XLIFF Filter | |
| InCopy ICML | .wcml | okf_icml |
ICML Filter | |
| InDesign IDML | .idml | okf_idml |
IDML Filter | |
| iOS/Mac Strings | .strings | okf_regex-macStrings |
Regex Filter | |
| Java Properties | .properties | okf_properties |
Properties Filter | |
| Java Properties (Output not escaped) | .properties | okf_properties-outputNotEscaped |
Properties Filter | |
| Java XML Properties | .xml | okf_xml-JavaProperties |
XML Filter | |
| Java XML Properties (HTML strings) | .xml | okf_xmlstream-JavaPropertiesHTML |
XML Stream Filter | |
| JSON | .json | okf_json |
JSON Filter | |
| Haiku CatKeys | .catkeys | okf_table_catkeys |
Table Filter | |
| HTML (any) | .html, .htm | okf_html |
HTML Filter | |
| HTML (Well-formed, and XHTML) | .html, .htm | okf_html-wellFormed |
HTML Filter | |
| HTML5 (and XHTML5) | .html, .htm | okf_itshtml5 |
HTML5-ITS Filter | |
| Markdown | .md | okf_markdown |
Markdown Filter | |
| Microsoft Excel 2007/2010 | .xlsx, .xlsm, .xltx, .xltm | okf_openxml |
OpenXML Filter | |
| Microsoft PowerPoint 2007/2010 | .pptx, .pptm, .potx, .potm, .ppsx, .ppsm | okf_openxml |
OpenXML Filter | |
| Microsoft Visio | .vsdx, .vsdm | okf_openxml |
OpenXML Filter | |
| Microsoft Word 2007/2010 | .docx, .docm, .dotx, .dotm | okf_openxml |
OpenXML Filter | |
| MIF | .mif | okf_mif |
MIF Filter | |
| Moses Text | .txt | okf_mosestext |
Moses Text Filter | |
| OpenOffice.org Calc | .ods, .ots | okf_odf |
OpenOffice Filter | |
| OpenOffice.org Draw | .odg, .otg | okf_odf |
OpenOffice Filter | |
| OpenOffice.org Impress | .odp, .otp | okf_odf |
OpenOffice Filter | |
| OpenOffice.org Writer | .odt, .ott | okf_odf |
OpenOffice Filter | |
okf_pdf |
PDF Filter | |||
| Pensieve TM | .pentm | okf_pensieve |
Pensieve TM Filter | |
| PHP Content | .php | okf_phpcontent |
PHP Content Filter | Can be used as a subfilter only |
| Plain Text (Line = text unit) | .txt | okf_plaintext |
Plain Text Filter | |
| Plain Text (Paragraph = text unit) | .txt | okf_plaintext_paragraphs |
Plain Text Filter | |
| PO | .po | okf_po |
PO Filter | |
| PO (Monolingual style) | .po | okf_po-monolingual |
PO Filter | |
| Rainbow Translation Kit manifests | .rkm | okf_rainbowkit |
Rainbow Translation Kit Filter | Used as a tkit reader only |
| Regex (Any text-based format) | .txt | okf_regex |
Regex Filter | |
| RDF (Mozilla RDF) | .rdf | okf_xml-MozillaRDF |
XML Filter | |
| RESX | .resx | okf_xml-resx |
XML Filter | |
| SDLPPX | .sdlppx | okf_sdlpackage |
SDL Trados Package Filter | |
| SDLRPX | .sdlrpx | okf_sdlpackage |
SDL Trados Package Filter | |
| SDLXLIFF | .sdlxlf | okf_xliff-sdl |
XLIFF Filter | |
| Skype Language Files | .lang | okf_properties-skypeLang |
Properties Filter | |
| SRT (Sub-Rip Text, sub-titles files) | .srt | okf_regex-srt |
Regex Filter | |
| Tab-Delimiter files | .tsv, .txt | okf_table_tsv |
Table Filter | |
| Tex files | .tex | okf_tex |
TEX Filter | |
| TMX | .tmx | okf_tmx |
TMX Filter | |
| Transifex project | .txp | okf_transifex |
Transifex Filter | |
| Trados-Tagged RTF | .rtf | okf_tradosrtf |
Trados-Tagged RTF Filter | |
| TS - Qt TS files | .ts | okf_ts |
TS Filter | |
| TTX - Trados TagEditor TTX files | .ttx | okf_ttx |
TTX Filter | |
| TXML - Wordfast Pro TXML files | .txml | okf_txml |
TXML Filter | |
| Vignette Export/Import Content | .xml | okf_vignette |
Vignette Filter | |
| WSXZ Package Filter | .wsxz | okf_wsxzpackage |
WSXZ Package Filter | |
| XHTML | .html, .htm | okf_html-wellFormed |
HTML Filter | |
| WIX (Windows Installer XML) localization files | .wix | okf_xml-WixLocalization |
XML Filter | |
| XLIFF v1.2 | .xlf, .xliff | okf_xliff |
XLIFF Filter | |
| XLIFF v2 | .xlf | okf_xliff2 |
XLIFF-2 Filter | |
| XML (Generic, using ITS defaults) | .xml | okf_xml |
XML Filter | |
| XML (Generic, using stream reader) | .xml | okf_xmlstream |
XML Stream Filter | |
| YAML (Generic YAML filter) | .yml, .yaml | okf_yaml |
YAML Filter | |
| Message Format (ICU Message Format Filter) | Any container format that supports subfilters | okf_messageformat |
Message Format Filter |
Note that most filters allow you to create your own configurations to support more file formats.
Code Simplification Rules
There are two levels of code simplification: filter and step (the Inline Codes Simplifier Step and the Post-segmentation Inline Codes Removal Step). And there are different ways of configuring it:
Firstly, the extraction pipeline can contain just:
At the moment, only IDML Filter, XML Filter and Simplification Filter support this. It should be noted that the last one performs like a wrapper for another filter.
Secondly, the extraction pipeline can look like that:
This is the only way for filters that do not support their own code simplification, and it should be used with care because the final merge may not always handle this correctly. The aforementioned IDML Filter and XML Filter can perform their own simplification, and the added Inline Codes Simplifier Step should not affect the events produced.
Thirdly, the extraction pipeline can consist of:
Here, the Post-segmentation Inline Codes Removal Step performs code simplification after segmentation rules are applied, and it may be useful for skipping extra codes between segments.
By default, the Inline Codes Simplifier Step and Post-segmentation Inline Codes Removal Step maximise the trimming and merging (aka simplification) of inline codes. This can be tuned via the following string parameters:
removeLeadingTrailingCodes-trueby defaultmergeCodes-trueby defaultrules- empty by default
Only the Inline Codes Simplifier Step configuration can be overridden by the optional filter ones via the following parameters:
moveLeadingAndTrailingCodesToSkeleton- maps to theremoveLeadingTrailingCodesparametermergeAdjacentCodes- maps to themergeCodesparametersimplifierRules- maps to therulesparameter
The simplification rules allow the prevention of specific codes trimming or merging. When a rule matches a code, it means: do not simplify this code.
General Syntax
The rules parser ignores irrelevant whitespace. Rules can be separated by spaces, newlines or nothing. This makes it easier to accommodate various container formats and their whitespace normalization rules. When a rule applies, it means "do not simplify the matched code". Uppercase tokens are constants and predefined by the rule parser. Multiple rules are always OR'ed together.
For more details, see the JavaCC grammar: ../okapi/core/src/main/javacc/SimplifierRules.jj
Each rule starts with if and ends with ;.
if TYPE = "bold";
Multiple rules are always OR'ed together. In other words, if any rule matches a code, that code is not simplified.
Within one rule, expressions can be combined with and, or and parentheses.
if TYPE = "bold" or (DATA = "<br/>" and TAG_TYPE = STANDALONE);
The parser supports comments:
# This is a line comment if TYPE = "bold"; /* This is a block comment */ if TYPE = "italic";
Available Fields and Literals
The following string fields can be used with quoted string values:
| Field | Meaning |
|---|---|
DATA
|
The raw/native data of the inline code. For XML/HTML-like tags this is normally the complete raw code, for example <a>, </a> or <br/>. Depending on the filter and the configuration format, the literal value may also contain escaped markup, for example </a>. DATA is not just the tag name.
|
OUTER_DATA
|
The complete outer data of the inline code, if available. This is mainly relevant for formats that store inline codes as markup themselves, such as XLIFF or TMX. If no separate outer data is available, this returns the same value as DATA.
|
ORIGINAL_ID
|
The original inline-code ID from the filtered/input format, if available. If no original ID is available, it behaves like an empty string. |
TYPE
|
The abstract type of the inline code. This is often the tag name or a semantic type, for example "a", "bold", "italic", "underline" or "lb". Opening and closing codes that belong together normally have the same TYPE.
|
The following field is used with unquoted tag-type literals:
| Field | Allowed values |
|---|---|
TAG_TYPE
|
OPENING, CLOSING or STANDALONE
|
Important: TAG_TYPE is written with an underscore. TAGTYPE is invalid.
The tag-type literals are not strings and must not be quoted:
if TAG_TYPE = OPENING; if TAG_TYPE = CLOSING; if TAG_TYPE = STANDALONE;
This is invalid:
if TAG_TYPE = "CLOSING";
The following boolean flag literals can be used without an operator:
| Literal | Meaning |
|---|---|
ADDABLE
|
Matches codes for which code.isAdded() is true. Despite the rule name, this means that the code was added after extraction and was not found in the original source.
|
DELETABLE
|
Matches codes for which code.isDeleteable() is true. Such a code may be removed from the text, for example a formatting code, unlike a required placeholder such as %s.
|
CLONEABLE
|
Matches codes for which code.isCloneable() is true. Such a code may be duplicated in the text, for example a formatting code, unlike a placeholder that must occur exactly once.
|
Since a matching rule means do not simplify this code, a rule such as the following protects all codes that match any of these three flags from trimming and merging:
if ADDABLE or DELETABLE or CLONEABLE;
This rule is useful if the information expressed by these flags should be preserved and the code should not be merged with neighbouring codes or moved out of the text unit as a leading/trailing code.
Operators
String fields such as DATA, OUTER_DATA, ORIGINAL_ID and TYPE can be matched with quoted string values:
| Operator | Meaning | Example |
|---|---|---|
=
|
Exact string match | if TYPE = "bold";
|
!=
|
Negated exact string match | if TYPE != "bold";
|
~
|
Regular expression match | if DATA ~ ".*meta-ref.*";
|
!~
|
Negated regular expression match | if DATA !~ ".*meta-ref.*";
|
TAG_TYPE is matched against the unquoted tag-type literals:
if TAG_TYPE = OPENING; if TAG_TYPE != CLOSING;
Boolean flag literals are used directly:
if DELETABLE; if DELETABLE or CLONEABLE;
Rule Examples
If a code has any of these flags, then do not simplify it:
if DELETABLE or ADDABLE or CLONEABLE;
Match an opening a code by abstract type:
if TYPE = "a" and TAG_TYPE = OPENING;
Match a closing a code by abstract type:
if TYPE = "a" and TAG_TYPE = CLOSING;
Match a standalone br code by abstract type:
if TYPE = "br" and TAG_TYPE = STANDALONE;
Match by raw data. For XML/HTML-like tags, DATA usually contains the complete raw tag, not just the tag name:
if DATA = "<a>" and TAG_TYPE = OPENING; if DATA = "</a>" and TAG_TYPE = CLOSING; if DATA = "<br/>" and TAG_TYPE = STANDALONE;
Depending on the filter and the configuration format in which the rule is stored, XML/HTML characters may need to be escaped. For example, if the literal DATA value is </a>, use:
if DATA = "</a>" and TAG_TYPE = CLOSING;
Use TYPE when you want to match the abstract code type or tag name. Use DATA when you want to match the actual raw/native inline-code content.
Regular expression match:
if DATA ~ ".*meta-ref.*";
Regular expression match for opening a tags where the raw tag may contain attributes:
if DATA ~ "<a[ >].*" and TAG_TYPE = OPENING;
Negated regular expression match:
if DATA !~ ".*meta-ref.*";
Match on code type; in this case, do not simplify line break codes:
if TYPE = "lb";
Do not simplify rich text types:
if TYPE = "bold" or TYPE = "italic" or TYPE = "underline";
Expressions can be recursive and support parentheses:
if TYPE = "bold" or (DATA = "bar" or (DATA = "foo" and TYPE = "underline"));
Match by original ID, if the filter provides one:
if ORIGINAL_ID = "1"; if ORIGINAL_ID ~ "b[0-9]+";
Match by outer data, for formats that store inline codes as markup themselves, such as XLIFF or TMX:
if OUTER_DATA ~ ".*<ph.*";
Match flag literals individually:
# Do not simplify codes added after extraction if ADDABLE; # Do not simplify codes that may be removed if DELETABLE; # Do not simplify codes that may be duplicated if CLONEABLE;
Or combine them in one rule:
if ADDABLE or DELETABLE or CLONEABLE;
Filter Config Examples
Examples of using simplifier rules within the filter configuration formats used by Okapi.
YAML
simplifierRules: | if ADDABLE or DELETABLE or CLONEABLE; if TYPE = "a" and TAG_TYPE = OPENING; if TYPE = "a" and TAG_TYPE = CLOSING; if DATA = "<br/>" or DATA = "<font>" or DATA = "</font>" or DATA = "</a>"; if DATA ~ "\\<font.+" or DATA ~ "\\<img.+" or DATA ~ "\\<a.+";
ITS
ITS rules are XML. Therefore, XML entity escaping happens before the simplifier rule parser sees the rules.
- To pass a literal
<or>character to the rule parser from an ITS file, write<or>in the ITS file. - To compare against a
DATAvalue that itself contains the literal characters<and>, write&lt;and&gt;in the ITS file. - The same escaping rules apply to
OUTER_DATA. For example, to let the rule parser see a regular expression containing<ph, write<phin the ITS file.
<?xml version="1.0" encoding="UTF-8"?>
<its:rules
xmlns:its="http://www.w3.org/2005/11/its"
version="1.0"
xmlns:itsx="http://www.w3.org/2008/12/its-extensions"
xmlns:okp="okapi-framework:xmlfilter-options">
<its:translateRule selector="//*" translate="yes"/>
<its:withinTextRule selector="//codeph" withinText="yes"/>
<its:withinTextRule selector="//ph" withinText="yes"/>
<okp:simplifierRules
moveLeadingAndTrailingCodesToSkeleton="yes"
mergeAdjacentCodes="yes">
if ADDABLE or DELETABLE or CLONEABLE;
if TYPE = "a" and TAG_TYPE = OPENING;
if TYPE = "a" and TAG_TYPE = CLOSING;
# DATA example: matches a code whose DATA value literally contains </a>.
# Because this rule is inside ITS/XML, the ampersands are escaped as &.
if DATA = "&lt;/a&gt;" and TAG_TYPE = CLOSING;
# OUTER_DATA example: matches outer inline-code markup such as <ph id="1">code</ph>.
# Because this rule is inside ITS/XML, the < and > characters in the regex are escaped.
if OUTER_DATA ~ ".*<ph[^>]*>.*</ph>.*";
</okp:simplifierRules>
</its:rules>
The DATA example above is for the case where the value stored in DATA is the literal string </a>. If the value stored in DATA is the raw string </a> instead, then the ITS file only needs one XML escaping level:
<okp:simplifierRules moveLeadingAndTrailingCodesToSkeleton="yes" mergeAdjacentCodes="yes"> if DATA = "</a>" and TAG_TYPE = CLOSING; </okp:simplifierRules>
FPRM / Parameters
#v1 extractNotes.b=true simplifierRules=if ADDABLE or DELETABLE or CLONEABLE; if TYPE = "a" and TAG_TYPE = OPENING; if TYPE = "a" and TAG_TYPE = CLOSING;
Font Mapping
The font mapping can be considered as a filter's ability to automatically substitute font information in the target document on the fly, according to a provided configuration - this helps to reduce the amount of reformatting and post-translation DTP. It is supported by IDML and OpenXML (DOCX, PPTX and XLSX documents) filters at the moment.
The following font mapping configuration options are available:
- The source locale regular expression pattern:
.*,en.*,en-UK, etc. It can be ommited to apply the mapping to any source locale. - The target locale regular expression pattern:
.*,ru.*,ru-RU, etc. It can be ommited to apply the mapping to any target locale. - The source font name regular expression pattern:
.*,Arial.*,Times New Roman, etc. It can be ommited to apply the mapping to any source font name found. - The target font name:
Arial,Times New Roman, etc. It should not be empty. And if it is made so, the mapping configuration is ignored.
Also, the configured font mappings are applied in the order they are stated. And the final target font value is determined by a sequential substitution of the source font values. I.e. if there is more than one mapping:
Arial->Times New RomanTimes New Roman->Sans Serif
then the first mapping will produce Times New Roman replacement and the second one will be applied to this new value, thus, ending up with the Sans Serif.
The parameters serialisation format can look like that:
fontMappings.0.sourceLocalePattern=en.* fontMappings.0.targetLocalePattern=ru.* fontMappings.0.sourceFontPattern=Times.* fontMappings.0.targetFont=Arial Unicode MS fontMappings.1.sourceLocalePattern=ru fontMappings.1.targetLocalePattern=fr fontMappings.1.sourceFontPattern=The Sims Sans fontMappings.1.targetFont=Arial Unicode MS fontMappings.number.i=2
When source locale, target locale and source font are omitted:
fontMappings.0.targetFont=Arial Unicode MS fontMappings.number.i=1
And this is the same as the abovementioned:
fontMappings.0.sourceLocalePattern=.* fontMappings.0.targetLocalePattern=.* fontMappings.0.sourceFontPattern=.* fontMappings.0.targetFont=Arial Unicode MS fontMappings.number.i=1