Okapi Framework - Steps

Tokenization Step

- Overview
- Parameters

If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://www.opentag.com/okapi/wiki/index.php?title=Tokenization_Step

Overview

This step creates annotations containing different tokenized forms of the text units.

Takes: Filter events
Sends: Filter events

The following token types are recognized:

Token Type Description
WORD A run of characters constituting a word of the given language.
NUMBER Numbers, including any commas or points symbols.
WHITESPACE Whitespace characters as defined by the Unicode Consortium standards.
PUNCTUATION Punctuation characters as defined by the Unicode Consortium standards.
DATE Dates in the MM/DD/YYYY format.
TIME Time separated by either : or . (24 hour, 12 hour with AM or PM).
CURRENCY Sums in US dollars.
ABBREVIATION Abbreviations like pct in 3.3pct, U.S., USD.
MARKUP A run that begins with < and ends with > like in HTML and XML.
E-MAIL E-mail addresses.
INTERNET An Internet address (URI or IP): http://www.somesite.org/foo/index.html, 192.168.0.5.
COMPANY Company names like AT&T, P&G, Johnson&Johnson.
EMOTICON Emoticon sequences like :-).
IDEOGRAM Ideograms as defined by the Unicode Consortium standards.
KANA Hiragana, Katakana (Japanese).
STOPWORD Stop-word (frequent words with no important significance, often filtered out by NLP tools).

The created token annotations can be retrieved by other steps down the pipeline, or by external tools.

Parameters

Tokenize source -- Set this option to tokenize the content of the source.

Tokenize targets -- Set this option to tokenize the content of the different selected targets.

Select languages, or specify a locale filter -- Specify what locales should be tokenized. Leave the field empty to select all locales. Press the Select button to choose what languages should be tokenized, or type in a locale filter string following these syntactic rules:

Select tokens to extract -- Specify what types of tokens should be generated. Leave the field empty to generate all types of token available. Press the Select button to choose what token types should be generated, or list the token types, separating them with a comma. (The field automatically converts to uppercase.)