Okapi Framework - Steps

Term Extraction Step

- Overview
- Parameters
- Credits

If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://www.opentag.com/okapi/wiki/index.php?title=Term_Extraction_Step

Overview

This step generate an output file that contains a list of possible terms found in the source content of the text units it processes.

Takes: Filter events
Sends: Filter events

The filter events are send to the next step unchanged.

No term extraction is done on the entries set as non-translatable.

The step processes all input documents before creating the list of the candidate terms.

Parameters

Output path -- Enter the full path of the tab-delimited file to generate. The file produced is always in UTF-8. It contains one candidate term per line: The number of occurrences in the first column and the term itself in the second column. You can use the variable ${rootDir} in the path.

Open the result file after completion -- Set this option to automatically open the result file after the process is complete.

Minimum number of words per term -- Enter the minimum number of words a sequence of words must have to be retained as a term.

Maximum number of words per term -- Enter the maximum number of words a sequence of words must have to be retained as a term. You want to keep this value to a resonable maximum (6 or 7, or less).

Minimum number of occurrences per term -- Enter the minimum number of times a given sequence of words must appear the overall text (all input files) to be retained as a term.

Preserve case differences -- Set this option to preserve all case differences when computing the sequences of words that may become terms. If this option is set the two words "Term" and "term" are seen as different words, if the option is not set they are seen as the same word.

Remove entries that seem to be sub-strings of longer entries -- Set this option to perform an extra process that removes any candidate term that seems to be the sub-sequence of another longer sequence of words.

Given the following list of possible terms and their occurrence counts:

The entries retained (and their updated occurrence count) will be:

The entry "gnu free document" is retained because it occurs one more time than "gnu free document license". This means in one case it stands on its own and not as a sub-sequence of the longer sequences. The entry "gnu free" is removed because it occurs 6 times, which corresponds to sum of the times it can be found as a sub-sequence of another sequence which itself is not a sub-sequence (5 times in "gnu free document license" and once in "gnu free document").

Keep in mind that, in some cases, setting this auto-cleaning option may lead to losing valid entries, when some valid entries happend only as sub-sequence of longer less valid entries.

Sort the results by the number of occurrences -- Set this option to sort the results by the number of times each term occurs in the input files. If this option is not set the results are listed alphabetically.

Path of the file with stop words (leave empty for default) -- Enter the path of the text file containing the list of stop words to use. Leave the path empty to use the default (a list for English). Stop words are words that stop a sequence of words even that sequence has not reach yet the maximum number of words allowed.

Path of the file with not-start words (leave empty for default) -- Enter the path of the text file containing the list of not-start words to use. Leave the path empty to use the default (a list for English). Not-start words are words that do not appear at the beginning of a term (but they can appear within a term or at its end).

Path of the file with not-end words (leave empty for default) -- Enter the path of the text file containing the list of not-end words to use. Leave the path empty to use the default (a list for English). Not-end words are words that do not appear at the end of a term (but they can appear within a term or at its beginning).

Credits

Special thanks to Jean-Christophe Hélary, Frank Kuhnke, Jaroslaw Michalak, Alessandra Muzzi, and other people from the Okapi Users Group for helping with improving this utility.