HTML Filter
Overview
The HTML Filter is an Okapi component that implements the IFilter interface for HTML and XHTML documents.
Processing Details
Input Encoding
The filter decides which encoding to use for the input document using the following logic:
- If the document has an encoding declaration it is used.
- Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.
Output Encoding
If the output encoding is UTF-8:
- If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
- If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.
If the input file has no declared encoding, the filter tries to add one in output. A <meta>
tag for HTML files, or a <meta />
tag for XHTML files. The potential addition is done only if there is a <head>
element in the file.
Line-Breaks
The type of line-breaks of the output is the same as the one of the original input.
Entities
Character and numeric entities are converted to Unicode. Entities defined in a DTD or schema are passed through without change.
Note that text entity declarations can be processed by the DTD Filter.
Parameters
Built-in Configuration
The HTML filter does not currently have a user interface to modify its configuration files. By default the HTML filter uses a minimalist configuration file that does not create structural groupings. For example, a table group or list group will never be created.
There is a predefined maximalist configuration (okf_html-wellFormed
) that can be used if structural groupings are needed. The caveat is that any structural tags that map to groups must be well formed, that is, they must have a start and end tag. Otherwise the filter return an error.
HTML Configuration Syntax
For the truly brave, you can create your own HTML configuration files. These configurations are written in YAML. See the wellformedConfiguration.yml
and nonwellformedConfiguration.yml
for examples.
HTML tags are associated with rules. These rules are used by the filter to process the input document.
Notes:
- All attributes and elements names should be in lowercase in the configuration file, regardless of their casing in the document.
- Element or attributes with a prefix should be declared with the prefix (and between single quotes) in the configuration (e.g.
'xml:lang'
)
Configuring Element Rules
The elements
section of the configuration consists of a set of key-value pairs. Each key is an element name, and the value is the rules for that element, represented as another set of key-value pairs. An element declaration should include one or more of the available element rules:
ruleTypes
|
Basic description of how the filter treats this tag. See #Rule Types. |
idAttributes
|
A list containing attributes which may provide the segment ID for text contained within this element. |
conditions
|
A condition that further restricts this rule. For example, to indicate that the element should only be handled if it contains an attribute with a certain value. See #Condition Syntax. |
translatableAttributes
|
Contains information about translatable attributes in this element. See #Configuring Translatable Attributes. |
elementType
|
Indicates the corresponding XLIFF 1.2 type value for this element.
|
writableLocalizationAttributes
|
Specifies attributes which are writable, but not translatable. (TODO) |
Rule Types
The rules types are the following:
INLINE
|
A tag which may occur inside a text run. For example <b> , <i> , and <u> .
|
GROUP
|
Defines a group of elements that are structurally bound. For example <table> , <div> and <menu> .
|
EXCLUDE
|
Prevents extraction of any text until the end tag of the same element is found. For example, if the content between a <script> element should not be extracted then define <script> as EXCLUDE .
|
INCLUDE
|
Overrides any current exclusions. This allows exceptions for children of EXCLUDE d elements.
|
TEXTUNIT
|
A tag that starts a complex text unit. Examples include <p> , <title> , <h1> . Complex text units carry their surrounding tags along with any extracted text.
|
PRESERVE_WHITESPACE
|
A tag that must preserve its white spaces as-is. For example <pre> .
|
ATTRIBUTES_ONLY
|
A tag that has localizable or translatable attributes but does not have translatable content. |
ATTRIBUTE_TRANS
|
A translatable attribute. |
ATTRIBUTE_WRITABLE
|
A writable or modifiable attribute, but not translatable. |
ATTRIBUTE_READONLY
|
A read-only attribute, extracted but that cannot be modified. |
Configuring Translatable Attributes
Translatable attributes may be specified in two ways, depending on the level of complexity needed.
If all the specified attributes should always be translated, they can be exposed as a simple list. For example, the definition for the <area>
element specifies that accesskey
, area
, and alt
attributes are translatable:
area: ruleTypes: [ATTRIBUTES_ONLY] translatableAttributes: [accesskey, area, alt]
However, if additional restrictions on translatable attributes are present, the translatableAttributes
rule may be specified as a set of key-value pairs, with each key being a translatable attribute and each value being an (optional) list of conditions, using the #Condition Syntax. For example, this snippet defines the handling of the <input>
element in the built-in configurations:
input: ruleTypes: [INLINE] translatableAttributes: alt: [type, NOT_EQUALS, [file, hidden, image, password]] value: [type, NOT_EQUALS, [file, hidden, image, password]] accesskey: [type, NOT_EQUALS, [file, hidden, image, password]] title: [type, NOT_EQUALS, [file, hidden, image, password]]
This specifies that there are four attributes (alt
, value
, accesskey
, and title
) that are translatable. The translatability of each of these attributes is conditional on the <input>
element not having particular type
values.
Condition Syntax
Rule conditions are expressed as a list of the form
[attribute, operation, value]
attribute
|
The name of the attribute which the condition applies to. |
operation
|
Available operations are EQUALS , NOT_EQUALS , and MATCHES . EQUALS and NOT_EQUALS test for (case-insensitive) string matches, while MATCHES uses a regular expression.
|
value
|
The value of the attribute to be compared using the operation. |
Inline Code Finder
You can define a set of regular expressions to capture span of extracted text that should be treated as inline codes. For example, some element content may have variables that need to be protected from modification and treated as codes. Use the useCodeFinder
and codeFinderRules
options for this.
useCodeFinder: true codeFinderRules: "#v1\ncount.i=1\nrule0=\\bVAR\\d\\b"
Note that the regular expression is "\bVAR\d\b
" but you must escape the backslash in the YAML notation as well.
You can also use this alternate syntax, which is slightly easier to read:
useCodeFinder: true codeFinderRules: |- #v1 count.i=1 rule0=\bVAR\d\b
The options above will set the text "VAR1
" as in-line code in the following HTML:
<p>Number of files = VAR1</p>
To facilitate the creation of code finder rules Rainbow provides the Code Finder Editor.
Character Entity References in Output
By default extended characters are not using character entity references in output (e.g. ©
for the character '©').
You can change this by specifying the escapeCharacters
rule with a string of all the characters you wish to see output as character entity reference. Any specified character that is not extended or has no HTML character entity defined is processed like a normal character.
For example, given the following rule:
escapeCharacters: "© €µÆĄ"
The output of <p>© €µÆĄ</p>
(assuming the output encoding is UTF-8) will be:
<p>© €µÆĄ</p>
Only the character Ą
(U+0104) is not represented as an entity reference because there is no HTML character entity defined for it.
Inline CDATA
For formats that use CDATA in ways that undesirably break the flow of text, you can set the filter to treat CDATA as if it was an inline element like so:
inlineCdata: true
Then markup such as <p>Text with <![CDATA[inline]]> CDATA</p>
will be extracted as if <![CDATA[
was a regular inline opening tag and ]]>
was a regular inline closing tag.
Excluding By Default
Normally, there is an implicit "default rule" to include elements. If the filter configuration contained no tag information at all, the default behavior of the filter would be to expose all PCDATA for translation. Sometimes it is useful to change this behavior in order to make your configuration more concise. This can be done by setting the exclude_by_default
option in your config.
For example, if you wished to have a custom configuration that exposed the translation of the <title>
element but nothing else. You could specify this as
exclude_by_default: true // .... other configuration elements: title: ruleTypes: [TEXTUNIT]
Quote Mode
Escaping of quote and apostrophe (single quote) characters can be changed by adding these lines to the config file:
quoteModeDefined: true quoteMode: 3
Current quote modes:
- Do not escape single or double quotes: UNESCAPED = 0
- Escape single and double quotes to a named entity: ALL = 1
- Escape double quotes to a named entity, and single quotes to a numeric entity: NUMERIC_SINGLE_QUOTES = 2
- Escape double quotes only: DOUBLE_QUOTES_ONLY = 3
Miscellaneous Options
- cleanupHtml: false - turn off post-processing cleanup of input file. The filter attempts to clean up common syntax errors such as unquoted attributes. This option turns off this feature.
Limitations
- In the current version of the filter the content of
<style>
and<script>
elements is not extracted. - Tags from server-side scripts such as PHP, ASPX, JSP, etc. are not formally supported and will be treated as non-translatable.