Ysavourel: 1 revision imported

2016-06-04T23:20:00Z

1 revision imported

← Older revision	Revision as of 19:20, 4 June 2016
(No difference)

Ysavourel at 12:06, 20 September 2010

2010-09-20T12:06:58Z

New page

{{Steps Header}}
__TOC__
==Overview==

This step creates annotations containing different tokenized forms of the text units.

Takes: Filter events. Sends: Filter events.

The following token types are recognized:

{| border="1" cellpadding="5" cellspacing="0"
|+
| '''Token Type'''
| '''Description'''
|- valign="top"
| <code>WORD</code>
| A run of characters constituting a word of the given language.
|-
| <code>NUMBER</code>
| Numbers, including any commas or points symbols.
|-
| <code>WHITESPACE</code>
| Whitespace characters as defined by the Unicode Consortium standards.
|-
| <code>PUNCTUATION</code>
| Punctuation characters as defined by the Unicode Consortium standards.
|-
| <code>DATE</code>
| Dates in the format MM/DD/YYYY.
|-
| <code>TIME</code>
| Time separated by either "<code>:</code>" or "<code>.</code>" (24 hour, 12 hour with AM or PM).
|-
| <code>CURRENCY</code>
| Sums in US dollars.
|-
| <code>ABBREVIATION</code>
| Abbreviations like pct in 3.3pct, U.S., USD.
|-
| <code>MARKUP</code>
| A run that begins with "<code><</code>" and ends with "<code>></code>" like in HTML and XML.
|-
| <code>E-MAIL</code>
| E-mail addresses.
|-
| <code>INTERNET</code>
| An Internet address (URI or IP): <code><nowiki>http://www.somesite.org/foo/index.html</nowiki></code>, <code>192.168.0.5</code>.
|-
| <code>COMPANY</code>
| Company names like AT&T, P&G, Johnson&Johnson.
|-
| <code>EMOTICON</code>
| Emoticon sequences like "<code>:-)</code>".
|-
| <code>IDEOGRAM</code>
| Ideograms as defined by the Unicode Consortium standards.
|-
| <code>KANA</code>
| Hiragana, Katakana (Japanese).
|-
| <code>STOPWORD</code>
| Stop-word (frequent words with no important significance, often filtered out by NLP tools).
|}

==Parameters==

<cite>Tokenize source</cite> — Set this option to tokenize the content of the source.

<cite>Tokenize targets</cite> — Set this option to tokenize the content of the different selected targets.

<cite>Select languages, or specify a locale filter</cite> — Specify what locales should be tokenized. Leave the field empty to select all locales. Press the Select button to choose what languages should be tokenized, or type in a locale filter string following these syntactic rules:

* The locale filter string can list one or more locale descriptors.

* Locale descriptors should be separated with a space or comma.

* A locale descriptor represents a locale tag, mask, or [[Regular Expressions|regular expression]].

* The locale tag has one to tree fields: language, region, and user data, separated with a dash. Example:

en, en-us, en-us-win

* The mask contains the * wildcard in either or all of the three fields. Examples:

en-*, *-us, *-*-win

* The regular expression (in Java format) describes a set of allowed locale tags. Example: this regular expression allows the en-* and es-* masks, and en-us, en-gb, es-us locale tags.

e[ns]-.+

* To specify that the given locale descriptor should be excluded, prepend it with !. Example: all English locales except English-New Zealand.

en !en-nz

* A regular expression should be prepended with @. Example:

@e[ns]-.+

* To provide flags for a regular expression, prepend the decimal flags value with ^. Example:

@e[ns]-.+ ^8

<cite>Select tokens to extract</cite> — Specify what types of tokens should be generated. Leave the field empty to generate all types of token available. Press the <cite>Select</cite> button to choose what token types should be generated, or list the token types, separating them with a comma. (The field automatically converts to uppercase.)

==Limitations==

None known.

[[Category:Steps]]

Tokenization Step - Revision history

Ysavourel: 1 revision imported

Ysavourel at 12:06, 20 September 2010