Okapi Framework - Filters

HTML Filter

- Overview
- Processing Details
- Parameters

If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://www.opentag.com/okapi/wiki/index.php?title=HTML_Filter

Overview

The HTML Filter is an Okapi component that implements the IFilter interface for HTML and XHTML documents. The filter is implemented in the class net.sf.okapi.filters.html.HtmlFilter of the Okapi library.

In the current version of the filter the content of <style> and <script> elements is not extracted, and tags from server-side scripts such as PHP, ASPX, JSP, etc. are not formally supported and will be treated as non-translatable.

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the following logic:

Output Encoding

If the output encoding is UTF-8:

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Entities

Character and numeric entities are converted to Unicode. Entities defined in a DTD or schema are passed through without change.

Parameters

Built in Configuration Files

The HTML filter does not currently have a user interface to modify its configuration files. By default the HTML filter uses a minimalist configuration file that does not create structural groupings. For example, a table group or list group will never be created. There is a pre-defined maximalist configuration file that can be used if structural groupings are needed. The caveat is that any structural tags that map to groups must be well formed, that is, they must have a start and end tag. Otherwise the filter return an error.

HTML Configuration Syntax

For the truly brave you can cerate your own HTML configuration files. See the defaultConfiguration.yml and maximalistConfiguration.yml for examples.

HTML tags are associated with rules. These rules are used by the filter to process the input document.

HTML Rule Types
INLINE A tag which may occur inside a text run. For example <b>, <i>, and <u>.
GROUP Defines a group of elements that are structurally bound. For example <table>, <div> and <menu>.
EXCLUDE Prevents extraction of any text until an end element of the same tag is found. For example, if the content between a <script> tag should not be extracted then define <script> as EXCLUDE
INCLUDE Overrides any current exclusions. This allows exceptions for children of EXCLUDEd tags.
TEXTUNIT A tag that starts a complex text unit. Examples include <p>, <title>, <h1>. Complex text units carry their surrounding tags along with any extracted text.
PRESERVE_WHITESPACE A tag that must preserve its whitespace and newlines as-is. For example <pre>.
ATTRIBUTES_ONLY A tag that has localizable or translatable attributes and does not.
ATTRIBUTE_TRANS A translatable attribute.
ATTRIBUTE_WRITABLE A writable or modifiable attribute, but not translatable.
ATTRIBUTE_READONLY A read-only attribute, extracted by may not be modified.

Inline Code Finder

You can define a set of regular expressions to capture span of extracted text that should be treated as inline codes. For example, some element content may have variables that need to be protected from modification and treated as codes. Use the useCodeFinder and codeFinderRules options for this.

useCodeFinder: true
codeFinderRules: "#v1\ncount.i=1\nrule0=\\bVAR\\d\\b"

The options above will set the text "VAR1" as in-line code in the follwoing HTML:

<p>Number of files = VAR1</p>

Note that the regular expression is "\bVAR\d\b" but you must escape the back-slash in the YAML notation as well.