Tokenization Step

Overview

This step creates annotations containing different tokenized forms of the text units.

Takes: Filter events. Sends: Filter events.

The following token types are recognized:


Token Type	Description
`WORD`	A run of characters constituting a word of the given language.
`NUMBER`	Numbers, including any commas or points symbols.
`WHITESPACE`	Whitespace characters as defined by the Unicode Consortium standards.
`PUNCTUATION`	Punctuation characters as defined by the Unicode Consortium standards.
`DATE`	Dates in the format MM/DD/YYYY.
`TIME`	Time separated by either "`:`" or "`.`" (24 hour, 12 hour with AM or PM).
`CURRENCY`	Sums in US dollars.
`ABBREVIATION`	Abbreviations like pct in 3.3pct, U.S., USD.
`MARKUP`	A run that begins with "`<`" and ends with "`>`" like in HTML and XML.
`E-MAIL`	E-mail addresses.
`INTERNET`	An Internet address (URI or IP): `http://www.somesite.org/foo/index.html`, `192.168.0.5`.
`COMPANY`	Company names like AT&T, P&G, Johnson&Johnson.
`EMOTICON`	Emoticon sequences like "`:-)`".
`IDEOGRAM`	Ideograms as defined by the Unicode Consortium standards.
`KANA`	Hiragana, Katakana (Japanese).
`STOPWORD`	Stop-word (frequent words with no important significance, often filtered out by NLP tools).

Parameters

Tokenize source — Set this option to tokenize the content of the source.

Tokenize targets — Set this option to tokenize the content of the different selected targets.

Select languages, or specify a locale filter — Specify what locales should be tokenized. Leave the field empty to select all locales. Press the Select button to choose what languages should be tokenized, or type in a locale filter string following these syntactic rules:

The locale filter string can list one or more locale descriptors.

Locale descriptors should be separated with a space or comma.

A locale descriptor represents a locale tag, mask, or regular expression.

The locale tag has one to tree fields: language, region, and user data, separated with a dash. Example:

en, en-us, en-us-win

The mask contains the * wildcard in either or all of the three fields. Examples:

en-*, *-us, *-*-win

The regular expression (in Java format) describes a set of allowed locale tags. Example: this regular expression allows the en-* and es-* masks, and en-us, en-gb, es-us locale tags.

e[ns]-.+

To specify that the given locale descriptor should be excluded, prepend it with !. Example: all English locales except English-New Zealand.

en !en-nz

A regular expression should be prepended with @. Example:

@e[ns]-.+

To provide flags for a regular expression, prepend the decimal flags value with ^. Example:

@e[ns]-.+ ^8

Select tokens to extract — Specify what types of tokens should be generated. Leave the field empty to generate all types of token available. Press the Select button to choose what token types should be generated, or list the token types, separating them with a comma. (The field automatically converts to uppercase.)

Limitations

None known.

Tokenization Step

Contents

Overview

Parameters

Limitations

Navigation menu

Tokenization Step

Overview

Parameters

Limitations

Navigation menu

Search