Plain Text Filter

From Okapi Framework
Jump to navigation Jump to search

Overview

Plain Text filter is an Okapi component that implements the IFilter interface for plain text documents. The filter is implemented in the class net.sf.okapi.filters.plaintext.PlainTextFilter of the library.

The filter processes text files encoded in ANSI, Unicode, UTF-8, UTF-16. Provides the byte-order mask detection.

Processing Details

Input Encoding

The filter decides which encoding to use for the input file using the following logic:

  • If the file has a Unicode Byte-Order-Mark:
    • Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
  • Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.

Output Encoding

If the output encoding is UTF-8:

  • If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
  • If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Parameters

General Tab

Extraction Mode

The filter provides 3 ways to extract translatable text:

Extract by paragraphs — If input text contains paragraphs (groups of lines separated by one or more empty lines), then the filter extracts the lines in a paragraph as a single text unit, line breaks are treated according to the settings in the Options - Multi-line text units group.

Extract by lines — Every line of text is extracted as a separate text unit.

Extract with a rule — text is extracted and text units are created based on the provided regex rule.

Spliced Lines

This group is active only in the Extract by lines mode.

If the input text contains spliced lines (i.e. lines that continue on the next line as in source code in some programming languages), the filter will merge such lines optionally moving the splicer character on a newly created inline code in its place.

Splicer — a character, signifying at the end of a line that text of the line continues after the line break. You can choose between backslash, underscore, or specify a custom splicer.

Create inline codes for splicers — create an inline code in place of every splicer in the text.

Extraction Rule

Regular expression — Java regular expression to extract translatable text. You can also set regex flags in 3 check-boxes.

Source group — If the regular expression contains groups, the spinner defines the index of the group which match contains the text to be extracted

Sample — You can enter any text to test the regular expression, extracted text will be enclosed in square brackets below the sample.

Options Tab

Text Unit Processing

Allow trimming — For CSV table type this option works together with CSV actions - Exclude leading/trailing white spaces from extracted text.

  • Trim leading spaces and tabs - if selected, extracted text is trimmed left.
  • Trim trailing spaces and tabs - if selected, extracted text is trimmed right.

Convert \t \n \\ \uXXXX into characters — If selected, escape sequences are converted to regular characters.

Multi-Line Text Units

When extracted text is multi-line, this group controls the way of combining multiple lines in a single text unit:

Separate lines with line feeds — multiple lines are extracted like a text run with \n separating the original lines.

Unwrap lines — multiple lines are merged in a single text run, a space is inserted in-between the original lines.

Create inline codes for line breaks — multiple lines are extracted like a single text run with an inline code containing the original line break and separating the lines.

Inline Codes

Has inline codes as defined below — Set this option to use the specified regular expressions on the text of the extracted items. Any match will be converted to an inline code.

Add — Click this button to add a new rule.

Remove — Click this button to remove the current rule.

Move Up — Click this button to move the current rule upward.

Move down — Click this button to move the current rule downward.

[Top-right text box] — Enter the regular expression for the current rule. Use the Modify button to enter the edit mode. The expression must be a valid regular expression. You can check the syntax (and the effect of the rule) as it automatically tests it against the test data in the text box below and shows the result in the bottom-right text box.

Modify — Click this button to edit the expression of the current rule. This button is labeled Accept when you are in edit mode.

Accept — Click this button to save any changes you have made to the expression and leave the edit mode. This button is labeled Modify when you are not in edit mode.

Discard — Click this button to leave the edit mode and revert the current rule to the expression it had before you started the edit mode.

Patterns — Click this button to display some help on regular expression patterns.

Test using all rules — Set this option to test all the rules at the same time. The syntax of the current rule is automatically checked. See the effect it has on the sample text. The result of the test are displayed in the bottom right result box. The parts of the text that are matches of the expressions are displayed in <> brackets. If the Test using all rules option is set, the test takes all rules of the set in account, if it is not set only the current rule is tested.

[Middle-right text box] — Optional test data to test the regular expression for the current rule or all rules depending on the Test using all rules option.

[Bottom-right text box] — Shows the result of the regular expression applied to the test data.

Limitations

None known.