Term Extraction Step

From Okapi Framework
Jump to: navigation, search

Overview

This step generate an output file that contains a list of possible terms found in the source content of the text units it processes.

Takes: Filter events. Sends: Filter events.

The step processes all input documents before creating the list of the candidate terms. The filter events received are send to the next step unchanged.

No term extraction is done on the entries set as non-translatable.

Possible terms are found using any of several methods: looking at Terminology annotations, looking at Text Analysis annotations, or performing simple statistical analysis on groups of tokens.

Parameters

Output path — Enter the full path of the tab-delimited file to generate. The file produced is always in UTF-8. It contains one candidate term per line: The number of occurrences in the first column and the term itself in the second column. You can use the variable ${rootDir} in the path.

Open the result file after completion — Set this option to automatically open the result file after the process is complete.

Sort the results by the number of occurrences — Set this option to sort the results by the number of times each term occurs in the input files. If this option is not set the results are listed alphabetically.

Use Terminology annotations — Set this option to take into account terms that are defined using the Terminology annotation. With this method, if a text unit source has a TermsAnnotation annotation created by a filter or another step, the terms in that annotation are added to the list of output terms. Such annotations are created, for example, by the ITS rules of the Terminology data category when running the XML Filter.

Use Text Analysis annotations — Set this option to take into account entries defined using the Text Analysis annotation. with this method, if a the source content of a text unit has a span annotated with the Text Analysis annotation, the span is added to the list of output terms. Such annotations are created, for example, by the Enrycher Step.

Use tokens-grouping statistics — Set this option to find term candidates by using simple statistical analysis on groups of tokens. With this method a term is defined as a given sequence of words. You can specify the maximum number of words a term can be made of, as well as how many times that same sequence of words must appears to be considered a term that needs to be extracted. Note that there is no linguistic process applied with this method, for example words are not reduced to their stems.

Minimum number of words per term — Enter the minimum number of words a sequence of words must have to be retained as a term. (Used only by the tokens-grouping method).

Maximum number of words per term — Enter the maximum number of words a sequence of words must have to be retained as a term. You want to keep this value to a reasonable maximum (6 or 7, or less). (Used only by the tokens-grouping method).

Minimum number of occurrences per term — Enter the minimum number of times a given sequence of words must appear the overall text (all input files) to be retained as a term. (Used only by the tokens-grouping method).

Preserve case differences — Set this option to preserve all case differences when computing the sequences of words that may become terms. If this option is set the two words "Term" and "term" are seen as different words, if the option is not set they are seen as the same word. (Used only by the tokens-grouping method).

Remove entries that seem to be sub-strings of longer entries — Set this option to perform an extra process that removes any candidate term that seems to be the sub-sequence of another longer sequence of words. (Used only by the tokens-grouping method).

Given the following list of possible terms and their occurrence counts:

  • gnu free = 6
  • gnu free document = 6
  • gnu free document license = 5

The entries retained (and their updated occurrence count) will be:

  • gnu free document = 1
  • gnu free document license = 5

The entry "gnu free document" is retained because it occurs one more time than "gnu free document license". This means in one case it stands on its own and not as a sub-sequence of the longer sequences. The entry "gnu free" is removed because it occurs 6 times, which corresponds to sum of the times it can be found as a sub-sequence of another sequence which itself is not a sub-sequence (5 times in "gnu free document license" and once in "gnu free document").

Keep in mind that, in some cases, setting this auto-cleaning option may lead to losing valid entries, when some valid entries occur only as sub-sequence of longer less valid entries.

Path of the file with stop words (leave empty for default) — Enter the path of the text file containing the list of stop words to use. Leave the path empty to use the default (a list for English). Stop words are words that stop a sequence of words even that sequence has not reach yet the maximum number of words allowed. (Used only by the tokens-grouping method).

Path of the file with not-start words (leave empty for default) — Enter the path of the text file containing the list of not-start words to use. Leave the path empty to use the default (a list for English). Not-start words are words that do not appear at the beginning of a term (but they can appear within a term or at its end). (Used only by the tokens-grouping method).

Path of the file with not-end words (leave empty for default) — Enter the path of the text file containing the list of not-end words to use. Leave the path empty to use the default (a list for English). Not-end words are words that do not appear at the end of a term (but they can appear within a term or at its beginning). (Used only by the tokens-grouping method).

Note: The default English lists are currently embedded in the compiled resources of the step. For normal distributions this JAR file is located in the lib sub-directory of the distribution: okapi-lib-VERSION.jar. The path of the lists in that JAR file is: net\sf\okapi\steps\termextraction. You can also see the list in the Git repository.

Limitations

  • None known.

Credits

Special thanks to Jean-Christophe Hélary, Frank Kuhnke, Jaroslaw Michalak, Alessandra Muzzi, and other people from the Okapi Users Group for helping with improving this step.