HTML Filter

From Okapi Framework
Jump to navigation Jump to search

Overview

The HTML Filter is an Okapi component that implements the IFilter interface for HTML and XHTML documents.

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the following logic:

  • If the document has an encoding declaration it is used.
  • Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.

Output Encoding

If the output encoding is UTF-8:

  • If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
  • If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

If the input file has no declared encoding, the filter tries to add one in output. A <meta> tag for HTML files, or a <meta /> tag for XHTML files. The potential addition is done only if there is a <head> element in the file.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Entities

Character and numeric entities are converted to Unicode. Entities defined in a DTD or schema are passed through without change.

Note that text entity declarations can be processed by the DTD Filter.

Parameters

Built-in Configuration

The HTML filter does not currently have a user interface to modify its configuration files. By default the HTML filter uses a minimalist configuration file that does not create structural groupings. For example, a table group or list group will never be created.

There is a predefined maximalist configuration (okf_html-wellFormed) that can be used if structural groupings are needed. The caveat is that any structural tags that map to groups must be well formed, that is, they must have a start and end tag. Otherwise the filter return an error.

HTML Configuration Syntax

For the truly brave, you can create your own HTML configuration files. These configurations are written in YAML. See the wellformedConfiguration.yml and nonwellformedConfiguration.yml for examples.

HTML tags are associated with rules. These rules are used by the filter to process the input document.

Notes:

  • All attributes and elements names should be in lowercase in the configuration file, regardless of their casing in the document.
  • Element or attributes with a prefix should be declared with the prefix (and between single quotes) in the configuration (e.g. 'xml:lang')

Configuring Element Rules

The elements section of the configuration consists of a set of key-value pairs. Each key is an element name, and the value is the rules for that element, represented as another set of key-value pairs. An element declaration should include one or more of the available element rules:

ruleTypes Basic description of how the filter treats this tag. See #Rule Types.
idAttributes A list containing attributes which may provide the segment ID for text contained within this element.
conditions A condition that further restricts this rule. For example, to indicate that the element should only be handled if it contains an attribute with a certain value. See #Condition Syntax.
translatableAttributes Contains information about translatable attributes in this element. See #Configuring Translatable Attributes.
elementType Indicates the corresponding XLIFF 1.2 type value for this element.
writableLocalizationAttributes Specifies attributes which are writable, but not translatable. (TODO)

Rule Types

The rules types are the following:

INLINE A tag which may occur inside a text run. For example <b>, <i>, and <u>.
GROUP Defines a group of elements that are structurally bound. For example <table>, <div> and <menu>.
EXCLUDE Prevents extraction of any text until the end tag of the same element is found. For example, if the content between a <script> element should not be extracted then define <script> as EXCLUDE.
INCLUDE Overrides any current exclusions. This allows exceptions for children of EXCLUDEd elements.
TEXTUNIT A tag that starts a complex text unit. Examples include <p>, <title>, <h1>. Complex text units carry their surrounding tags along with any extracted text.
PRESERVE_WHITESPACE A tag that must preserve its white spaces as-is. For example <pre>.
ATTRIBUTES_ONLY A tag that has localizable or translatable attributes but does not have translatable content.
ATTRIBUTE_TRANS A translatable attribute.
ATTRIBUTE_WRITABLE A writable or modifiable attribute, but not translatable.
ATTRIBUTE_READONLY A read-only attribute, extracted but that cannot be modified.

Configuring Translatable Attributes

Translatable attributes may be specified in two ways, depending on the level of complexity needed.

If all the specified attributes should always be translated, they can be exposed as a simple list. For example, the definition for the <area> element specifies that accesskey, area, and alt attributes are translatable:

  area:
    ruleTypes: [ATTRIBUTES_ONLY]
    translatableAttributes: [accesskey, area, alt]

However, if additional restrictions on translatable attributes are present, the translatableAttributes rule may be specified as a set of key-value pairs, with each key being a translatable attribute and each value being an (optional) list of conditions, using the #Condition Syntax. For example, this snippet defines the handling of the <input> element in the built-in configurations:

  input:
    ruleTypes: [INLINE]
    translatableAttributes:
      alt: [type, NOT_EQUALS, [file, hidden, image, password]]
      value: [type, NOT_EQUALS, [file, hidden, image, password]]
      accesskey: [type, NOT_EQUALS, [file, hidden, image, password]]
      title: [type, NOT_EQUALS, [file, hidden, image, password]]

This specifies that there are four attributes (alt, value, accesskey, and title) that are translatable. The translatability of each of these attributes is conditional on the <input> element not having particular type values.

Condition Syntax

Rule conditions are expressed as a list of the form

[attribute, operation, value]
attribute The name of the attribute which the condition applies to.
operation Available operations are EQUALS, NOT_EQUALS, and MATCHES. EQUALS and NOT_EQUALS test for (case-insensitive) string matches, while MATCHES uses a regular expression.
value The value of the attribute to be compared using the operation.

Inline Code Finder

You can define a set of regular expressions to capture span of extracted text that should be treated as inline codes. For example, some element content may have variables that need to be protected from modification and treated as codes. Use the useCodeFinder and codeFinderRules options for this.

useCodeFinder: true
codeFinderRules: "#v1\ncount.i=1\nrule0=\\bVAR\\d\\b"

Note that the regular expression is "\bVAR\d\b" but you must escape the backslash in the YAML notation as well.

You can also use this alternate syntax, which is slightly easier to read:

useCodeFinder: true
codeFinderRules: |-
   #v1
   count.i=1
   rule0=\bVAR\d\b

The options above will set the text "VAR1" as in-line code in the following HTML:

<p>Number of files = VAR1</p>

To facilitate the creation of code finder rules Rainbow provides the Code Finder Editor.

Character Entity References in Output

By default extended characters are not using character entity references in output (e.g. &copy; for the character '&copy').

You can change this by specifying the escapeCharacters rule with a string of all the characters you wish to see output as character entity reference. Any specified character that is not extended or has no HTML character entity defined is processed like a normal character.

For example, given the following rule:

escapeCharacters: "© €µÆĄ"

The output of <p>© €µÆĄ</p> (assuming the output encoding is UTF-8) will be:

<p>&copy;&nbsp;&euro;&micro;&AElig;Ą</p> 

Only the character Ą (U+0104) is not represented as an entity reference because there is no HTML character entity defined for it.

Inline CDATA

For formats that use CDATA in ways that undesirably break the flow of text, you can set the filter to treat CDATA as if it was an inline element like so:

 inlineCdata: true

Then markup such as <p>Text with <![CDATA[inline]]> CDATA</p> will be extracted as if <![CDATA[ was a regular inline opening tag and ]]> was a regular inline closing tag.

Excluding By Default

Normally, there is an implicit "default rule" to include elements. If the filter configuration contained no tag information at all, the default behavior of the filter would be to expose all PCDATA for translation. Sometimes it is useful to change this behavior in order to make your configuration more concise. This can be done by setting the exclude_by_default option in your config.

For example, if you wished to have a custom configuration that exposed the translation of the <title> element but nothing else. You could specify this as

exclude_by_default: true
// .... other configuration
elements:
   title:
     ruleTypes: [TEXTUNIT]

Quote Mode

Escaping of quote and apostrophe (single quote) characters can be changed by adding these lines to the config file:

quoteModeDefined: true
quoteMode: 3

Current quote modes:

  • Do not escape single or double quotes: UNESCAPED = 0
  • Escape single and double quotes to a named entity: ALL = 1
  • Escape double quotes to a named entity, and single quotes to a numeric entity: NUMERIC_SINGLE_QUOTES = 2
  • Escape double quotes only: DOUBLE_QUOTES_ONLY = 3

Miscellaneous Options

  • cleanupHtml: false - turn off post-processing cleanup of input file. The filter attempts to clean up common syntax errors such as unquoted attributes. This option turns off this feature.


Limitations

  • In the current version of the filter the content of <style> and <script> elements is not extracted.
  • Tags from server-side scripts such as PHP, ASPX, JSP, etc. are not formally supported and will be treated as non-translatable.