HTML5-ITS Filter

From Okapi Framework
Revision as of 19:12, 4 June 2016 by Ysavourel (talk | contribs) (1 revision imported)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


This filter allows you to process HTML5 documents. Those documents may contain ITS 2.0 markup.

The input documents are expected to be valid HTML5.

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the detection mechanism defined by the specification.

Output Encoding

The output encoding is the same as the input encoding, except if defined otherwise by the calling tool.


The type of line-breaks of the output is the same as the one of the original input.

Quote Mode

Escaping of quote and apostrophe (single quote) characters can be changed by adding these lines to the config file:


Current quote modes:

  • Do not escape single or double quotes: UNESCAPED = 0
  • Escape single and double quotes to a named entity: ALL = 1
  • Escape double quotes to a named entity, and single quotes to a numeric entity: NUMERIC_SINGLE_QUOTES = 2
  • Escape double quotes only: DOUBLE_QUOTES_ONLY = 3


ITS Support

By default the filter process the HTML5 documents based on the ITS defaults. That is:

  • The lang attribute is used as the local markup for the Language Information data category.
  • The id attribute is used as the local markup for the Id Value data category.
  • Most of the phrasing content elements are interpreted as withinText="yes" for the Element Within Text data category.
  • The translate attribute is used as the local markup for the Translate data category, and the behavior for that data category is different from the one in XML See the HTML5 definition for details.

Default behavior can be overridden when the input document contains ITS markup, or if a filter parameters file is specified. The parameters file used by the filter is an ITS document.

The Internationalization Tag set (ITS) is a W3C specification that defines a set of elements and attributes you can use to specify different internationalization- and localization-related aspects of your XML and HTML5 document, for instance: ITS allows you to define what element should be treated as a nested sub-flow of text, what element denotes a term, how to identify the language of a content, and much more.

The filter supports ITS 2.0.

See the "ITS" page for more details on the format.

The filter supports global and local rules and most data categories. See the ITS Components page for a detailed list of how the data categories are supported and other information on the implementation.


  • This filter is BETA.