Tokenization Step

From Okapi Framework
Jump to navigation Jump to search

Overview

This step creates annotations containing different tokenized forms of the text units.

Takes: Filter events. Sends: Filter events.

The following token types are recognized:

Token Type Description
WORD A run of characters constituting a word of the given language.
NUMBER Numbers, including any commas or points symbols.
WHITESPACE Whitespace characters as defined by the Unicode Consortium standards.
PUNCTUATION Punctuation characters as defined by the Unicode Consortium standards.
DATE Dates in the format MM/DD/YYYY.
TIME Time separated by either ":" or "." (24 hour, 12 hour with AM or PM).
CURRENCY Sums in US dollars.
ABBREVIATION Abbreviations like pct in 3.3pct, U.S., USD.
MARKUP A run that begins with "<" and ends with ">" like in HTML and XML.
E-MAIL E-mail addresses.
INTERNET An Internet address (URI or IP): http://www.somesite.org/foo/index.html, 192.168.0.5.
COMPANY Company names like AT&T, P&G, Johnson&Johnson.
EMOTICON Emoticon sequences like ":-)".
IDEOGRAM Ideograms as defined by the Unicode Consortium standards.
KANA Hiragana, Katakana (Japanese).
STOPWORD Stop-word (frequent words with no important significance, often filtered out by NLP tools).

Parameters

Tokenize source — Set this option to tokenize the content of the source.

Tokenize targets — Set this option to tokenize the content of the different selected targets.

Select languages, or specify a locale filter — Specify what locales should be tokenized. Leave the field empty to select all locales. Press the Select button to choose what languages should be tokenized, or type in a locale filter string following these syntactic rules:

  • The locale filter string can list one or more locale descriptors.
  • Locale descriptors should be separated with a space or comma.
  • The locale tag has one to tree fields: language, region, and user data, separated with a dash. Example:
en, en-us, en-us-win
  • The mask contains the * wildcard in either or all of the three fields. Examples:
en-*, *-us, *-*-win
  • The regular expression (in Java format) describes a set of allowed locale tags. Example: this regular expression allows the en-* and es-* masks, and en-us, en-gb, es-us locale tags.
e[ns]-.+
  • To specify that the given locale descriptor should be excluded, prepend it with !. Example: all English locales except English-New Zealand.
en !en-nz
  • A regular expression should be prepended with @. Example:
@e[ns]-.+
  • To provide flags for a regular expression, prepend the decimal flags value with ^. Example:
@e[ns]-.+ ^8

Select tokens to extract — Specify what types of tokens should be generated. Leave the field empty to generate all types of token available. Press the Select button to choose what token types should be generated, or list the token types, separating them with a comma. (The field automatically converts to uppercase.)

Limitations

None known.