Okapi Framework - FiltersHTML Filter |
|
If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://www.opentag.com/okapi/wiki/index.php?title=HTML_Filter
The HTML Filter is an Okapi component that implements the IFilter interface for
HTML and XHTML documents. The filter is implemented in the class
net.sf.okapi.filters.html.HtmlFilter of the Okapi library.
In the current version of the filter the content of <style> and
<script> elements is not extracted, and tags from server-side scripts such
as PHP, ASPX, JSP, etc. are not formally supported and will be treated as non-translatable.
The filter decides which encoding to use for the input document using the following logic:
If the output encoding is UTF-8:
The type of line-breaks of the output is the same as the one of the original input.
Character and numeric entities are converted to Unicode. Entities defined in a DTD or schema are passed through without change.
The HTML filter does not currently have a user interface to modify its configuration files. By default the HTML filter uses a minimalist configuration file that does not create structural groupings. For example, a table group or list group will never be created. There is a pre-defined maximalist configuration file that can be used if structural groupings are needed. The caveat is that any structural tags that map to groups must be well formed, that is, they must have a start and end tag. Otherwise the filter return an error.
HTML tags are associated with rules. These rules are used by the filter to process the input document.
| HTML Rule Types | |
|---|---|
INLINE |
A tag which may occur inside a text run. For example <b>,
<i>, and <u>. |
GROUP |
Defines a group of elements that are structurally bound. For example
<table>, <div> and <menu>. |
EXCLUDE |
Prevents extraction of any text until an end element of the same tag is
found. For example, if the content between a <script> tag should not be
extracted then define <script> as EXCLUDE |
INCLUDE |
Overrides any current exclusions. This allows exceptions for children of
EXCLUDEd tags. |
TEXTUNIT |
A tag that starts a complex text
unit. Examples include <p>, <title>, <h1>. Complex
text units carry their surrounding tags along with any extracted text. |
PRESERVE_WHITESPACE |
A tag that must preserve its whitespace and newlines as-is. For example
<pre>. |
ATTRIBUTES_ONLY |
A tag that has localizable or translatable attributes and does not. |
ATTRIBUTE_TRANS |
A translatable attribute. |
ATTRIBUTE_WRITABLE |
A writable or modifiable attribute, but not translatable. |
ATTRIBUTE_READONLY |
A read-only attribute, extracted by may not be modified. |
You can define a set of regular expressions to capture span of extracted text
that should be treated as inline codes. For example, some element content may
have variables that need to be protected from modification and
treated as codes. Use the useCodeFinder and codeFinderRules
options for this.
useCodeFinder: true codeFinderRules: "#v1\ncount.i=1\nrule0=\\bVAR\\d\\b"
The options above will set the text "VAR1" as in-line code in
the follwoing HTML:
<p>Number of files = VAR1</p>
Note that the regular expression is "\bVAR\d\b" but you must
escape the back-slash in the YAML notation as well.