<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://okapiframework.org/wiki/index.php?action=history&amp;feed=atom&amp;title=Tokenization_Step</id>
	<title>Tokenization Step - Revision history</title>
	<link rel="self" type="application/atom+xml" href="http://okapiframework.org/wiki/index.php?action=history&amp;feed=atom&amp;title=Tokenization_Step"/>
	<link rel="alternate" type="text/html" href="http://okapiframework.org/wiki/index.php?title=Tokenization_Step&amp;action=history"/>
	<updated>2026-04-17T15:46:16Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.38.2</generator>
	<entry>
		<id>http://okapiframework.org/wiki/index.php?title=Tokenization_Step&amp;diff=359&amp;oldid=prev</id>
		<title>Ysavourel: 1 revision imported</title>
		<link rel="alternate" type="text/html" href="http://okapiframework.org/wiki/index.php?title=Tokenization_Step&amp;diff=359&amp;oldid=prev"/>
		<updated>2016-06-04T23:20:00Z</updated>

		<summary type="html">&lt;p&gt;1 revision imported&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;1&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;1&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 19:20, 4 June 2016&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-notice&quot; lang=&quot;en&quot;&gt;&lt;div class=&quot;mw-diff-empty&quot;&gt;(No difference)&lt;/div&gt;
&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</summary>
		<author><name>Ysavourel</name></author>
	</entry>
	<entry>
		<id>http://okapiframework.org/wiki/index.php?title=Tokenization_Step&amp;diff=358&amp;oldid=prev</id>
		<title>Ysavourel at 12:06, 20 September 2010</title>
		<link rel="alternate" type="text/html" href="http://okapiframework.org/wiki/index.php?title=Tokenization_Step&amp;diff=358&amp;oldid=prev"/>
		<updated>2010-09-20T12:06:58Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;{{Steps Header}}&lt;br /&gt;
__TOC__&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
This step creates annotations containing different tokenized forms of the text units.&lt;br /&gt;
&lt;br /&gt;
Takes: Filter events. Sends: Filter events.&lt;br /&gt;
&lt;br /&gt;
The following token types are recognized:&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellpadding=&amp;quot;5&amp;quot; cellspacing=&amp;quot;0&amp;quot;&lt;br /&gt;
|+&lt;br /&gt;
| '''Token Type'''&lt;br /&gt;
| '''Description'''&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| &amp;lt;code&amp;gt;WORD&amp;lt;/code&amp;gt;&lt;br /&gt;
| A run of characters constituting a word of the given language.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;NUMBER&amp;lt;/code&amp;gt;&lt;br /&gt;
| Numbers, including any commas or points symbols.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;WHITESPACE&amp;lt;/code&amp;gt;&lt;br /&gt;
| Whitespace characters as defined by the Unicode Consortium standards.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;PUNCTUATION&amp;lt;/code&amp;gt;&lt;br /&gt;
| Punctuation characters as defined by the Unicode Consortium standards.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;DATE&amp;lt;/code&amp;gt;&lt;br /&gt;
| Dates in the format MM/DD/YYYY.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;TIME&amp;lt;/code&amp;gt;&lt;br /&gt;
| Time separated by either &amp;quot;&amp;lt;code&amp;gt;:&amp;lt;/code&amp;gt;&amp;quot; or &amp;quot;&amp;lt;code&amp;gt;.&amp;lt;/code&amp;gt;&amp;quot; (24 hour, 12 hour with AM or PM).&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;CURRENCY&amp;lt;/code&amp;gt;&lt;br /&gt;
| Sums in US dollars.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;ABBREVIATION&amp;lt;/code&amp;gt;&lt;br /&gt;
| Abbreviations like pct in 3.3pct, U.S., USD.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;MARKUP&amp;lt;/code&amp;gt;&lt;br /&gt;
| A run that begins with &amp;quot;&amp;lt;code&amp;gt;&amp;amp;lt;&amp;lt;/code&amp;gt;&amp;quot; and ends with &amp;quot;&amp;lt;code&amp;gt;&amp;gt;&amp;lt;/code&amp;gt;&amp;quot; like in HTML and XML.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;E-MAIL&amp;lt;/code&amp;gt;&lt;br /&gt;
| E-mail addresses.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;INTERNET&amp;lt;/code&amp;gt;&lt;br /&gt;
| An Internet address (URI or IP): &amp;lt;code&amp;gt;&amp;lt;nowiki&amp;gt;http://www.somesite.org/foo/index.html&amp;lt;/nowiki&amp;gt;&amp;lt;/code&amp;gt;, &amp;lt;code&amp;gt;192.168.0.5&amp;lt;/code&amp;gt;.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;COMPANY&amp;lt;/code&amp;gt;&lt;br /&gt;
| Company names like AT&amp;amp;T, P&amp;amp;G, Johnson&amp;amp;Johnson.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;EMOTICON&amp;lt;/code&amp;gt;&lt;br /&gt;
| Emoticon sequences like &amp;quot;&amp;lt;code&amp;gt;:-)&amp;lt;/code&amp;gt;&amp;quot;.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;IDEOGRAM&amp;lt;/code&amp;gt;&lt;br /&gt;
| Ideograms as defined by the Unicode Consortium standards.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;KANA&amp;lt;/code&amp;gt;&lt;br /&gt;
| Hiragana, Katakana (Japanese).&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;STOPWORD&amp;lt;/code&amp;gt;&lt;br /&gt;
| Stop-word (frequent words with no important significance, often filtered out by NLP tools).&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==Parameters==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;cite&amp;gt;Tokenize source&amp;lt;/cite&amp;gt; &amp;amp;mdash; Set this option to tokenize the content of the source.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;cite&amp;gt;Tokenize targets&amp;lt;/cite&amp;gt; &amp;amp;mdash; Set this option to tokenize the content of the different selected targets.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;cite&amp;gt;Select languages, or specify a locale filter&amp;lt;/cite&amp;gt; &amp;amp;mdash; Specify what locales should be tokenized. Leave the field empty to select all locales. Press the Select button to choose what languages should be tokenized, or type in a locale filter string following these syntactic rules:&lt;br /&gt;
&lt;br /&gt;
* The locale filter string can list one or more locale descriptors.&lt;br /&gt;
&lt;br /&gt;
* Locale descriptors should be separated with a space or comma.&lt;br /&gt;
&lt;br /&gt;
* A locale descriptor represents a locale tag, mask, or [[Regular Expressions|regular expression]].&lt;br /&gt;
&lt;br /&gt;
* The locale tag has one to tree fields: language, region, and user data, separated with a dash. Example:&lt;br /&gt;
&lt;br /&gt;
 en, en-us, en-us-win&lt;br /&gt;
&lt;br /&gt;
* The mask contains the * wildcard in either or all of the three fields. Examples:&lt;br /&gt;
&lt;br /&gt;
 en-*, *-us, *-*-win&lt;br /&gt;
&lt;br /&gt;
* The regular expression (in Java format) describes a set of allowed locale tags. Example: this regular expression allows the en-* and es-* masks, and en-us, en-gb, es-us locale tags.&lt;br /&gt;
&lt;br /&gt;
 e[ns]-.+&lt;br /&gt;
&lt;br /&gt;
* To specify that the given locale descriptor should be excluded, prepend it with !. Example: all English locales except English-New Zealand.&lt;br /&gt;
&lt;br /&gt;
 en !en-nz&lt;br /&gt;
&lt;br /&gt;
* A regular expression should be prepended with @. Example:&lt;br /&gt;
&lt;br /&gt;
 @e[ns]-.+&lt;br /&gt;
&lt;br /&gt;
* To provide flags for a regular expression, prepend the decimal flags value with ^. Example:&lt;br /&gt;
&lt;br /&gt;
 @e[ns]-.+ ^8&lt;br /&gt;
&lt;br /&gt;
&amp;lt;cite&amp;gt;Select tokens to extract&amp;lt;/cite&amp;gt; &amp;amp;mdash; Specify what types of tokens should be generated. Leave the field empty to generate all types of token available. Press the &amp;lt;cite&amp;gt;Select&amp;lt;/cite&amp;gt; button to choose what token types should be generated, or list the token types, separating them with a comma. (The field automatically converts to uppercase.)&lt;br /&gt;
&lt;br /&gt;
==Limitations==&lt;br /&gt;
&lt;br /&gt;
None known.&lt;br /&gt;
&lt;br /&gt;
[[Category:Steps]]&lt;/div&gt;</summary>
		<author><name>Ysavourel</name></author>
	</entry>
</feed>