Okapi Framework - StepsTokenization Step |
|
|
- Overview - Parameters |
|
If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://www.opentag.com/okapi/wiki/index.php?title=Tokenization_Step
This step creates annotations containing different tokenized forms of the text units.
Takes: Filter events
Sends: Filter events
The following token types are recognized:
| Token Type | Description |
|---|---|
WORD |
A run of characters constituting a word of the given language. |
NUMBER |
Numbers, including any commas or points symbols. |
WHITESPACE |
Whitespace characters as defined by the Unicode Consortium standards. |
PUNCTUATION |
Punctuation characters as defined by the Unicode Consortium standards. |
DATE |
Dates in the MM/DD/YYYY format. |
TIME |
Time separated by either : or . (24 hour, 12 hour with AM or PM). |
CURRENCY |
Sums in US dollars. |
ABBREVIATION |
Abbreviations like pct in 3.3pct, U.S., USD. |
MARKUP |
A run that begins with < and ends with > like in HTML and XML. |
E-MAIL |
E-mail addresses. |
INTERNET |
An Internet address (URI or IP): http://www.somesite.org/foo/index.html, 192.168.0.5. |
COMPANY |
Company names like AT&T, P&G, Johnson&Johnson. |
EMOTICON |
Emoticon sequences like :-). |
IDEOGRAM |
Ideograms as defined by the Unicode Consortium standards. |
KANA |
Hiragana, Katakana (Japanese). |
STOPWORD |
Stop-word (frequent words with no important significance, often filtered out by NLP tools). |
The created token annotations can be retrieved by other steps down the pipeline, or by external tools.
Tokenize source -- Set this option to tokenize the content of the source.
Tokenize targets -- Set this option to tokenize the content of the different selected targets.
Select languages, or specify a locale filter -- Specify what locales should be tokenized. Leave the field empty to select all locales. Press the Select button to choose what languages should be tokenized, or type in a locale filter string following these syntactic rules:
en, en-us, en-us-win
en-*, *-us, *-*-win
e[ns]-.+
en !en-nz
@e[ns]-.+
@e[ns]-.+ ^8
Select tokens to extract -- Specify what types of tokens should be generated. Leave the field empty to generate all types of token available. Press the Select button to choose what token types should be generated, or list the token types, separating them with a comma. (The field automatically converts to uppercase.)