Table Filter

From Okapi Framework
Jump to: navigation, search

Overview

The Table Filter is an Okapi component that implements the IFilter interface for plain text documents. The filter is implemented in the class net.sf.okapi.filters.table.TableFilter of the library.

Processing Details

Input Encoding

The filter decides which encoding to use for the input file using the following logic:

  • If the file has a Unicode Byte-Order-Mark:
    • Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
  • Otherwise, the input encoding used is the default encoding that was specified when setting the filter options.

Output Encoding

If the output encoding is UTF-8:

  • If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
  • If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Parameters

Table Tab

Table Type

CSV — Select this option to work with formats where the columns are separated by a single character such as a comma, a semi-colon, a tab, etc.

TSV — Select this option to work with formats where the columns are separated by one or more tabs (i.e. two consecutive tabs do not mark an empty column). Note that for formats where the column separator is a single tab you should select CSV with a tab as the separator.

Fixed-width columns — Select this option to work with formats where each column has a fixed width.

Table Properties

When the table file contains a header with column names and optionally other info, you can specify which line contains column names, and from which line the actual table data are starting.

Values start at line — Specify the line number of the first table row (default 1, the data start from the beginning of the file, no header presents).

Line with column names — Specify the number of the line containing column names (default 0, i.e. no line with column names).

Lines are numbered from 1. The default settings describe a table without a header with column names, data start from the beginning of the file, the above table properties will be 1 and 0.

If you have a table which 1-st line contains column names, and consecutive lines (from 2 on) contain table data like in most CSV files, then specify 2 and 1 for the properties.

CSV Options

Field delimiter — Character separating fields in a row. Default is comma (,).

Text qualifier — Character before and after field value to allow field delimiters inside the field. For instance, this field will not be broken into parts though comma is a field delimiter: ["Field, containing comma"]. Default is the quotation mark (").

CSV Escaping Mode

If a field contains the active text qualifier (e.g. quotation mark), then all occurrences of that qualifier should be escaped. For instance, ["Text, ""quoted text"""] or ["Text, \"quoted text\""].

Duplicate qualifier — Escaping is performed by duplication of the active qualifier set in the CSV options group, e.g. ["Text, ""quoted text"""].

Backslash — Escaping is performed by prefixing all occurrences of the active qualifier with the backslash character (\), e.g. ["Text, \"quoted text\""].

CSV Actions

Exclude qualifiers from extracted text — If selected, qualifiers are removed from the text and go to the TU skeleton.

Exclude leading/trailing white spaces from extracted text — if selected, then trimming of leading/trailing white spaces is performed based on the trimming mode:

  • Only entries without qualifiers — only non-qualified field values are trimmed, leading and trailing spaces remain in qualified fields (e.g. [" text "] becomes [ text ], and [ non-qualified ] becomes [non-qualified] ).
  • All — both non-qualified and qualified field values are trimmed of leading and trailing spaces (e.g. [" text "] becomes [text], and [ non-qualified ] becomes [non-qualified] ).

Add qualifiers to output when appropriate — If selected, upon output qualifiers will be added (if not already present) to any value that contains a field delimiter or line break as part of its textual content.

Extraction Mode

If the table contains a header (i.e. one or more lines in the beginning of the file, containing description of the data, names of fields etc.), you can specify whether you want to extract the header data and/or data from the table body.

Extract header lines — When selected, you can choose among these options:

  • Column names only — only column names will be sent as separate TextUnits, one for every column name.
  • All — all header lines will be sent as TUs (the column names line will be sent as a series of TUs for every column name, other lines will be sent as one TU for every line).

Extract table data — When selected, TUs will be created for the table data (values in the table body), one TU for every row/column value.

Columns Tab

Extraction Mode

Extraction mode directs the filter in what columns contain translatable text to be extracted and placed on text units. Text in the columns not containing translatable text will be placed in a skeleton.

Extract from all columns — All columns contain translatable text.

Extract by column definitions — The filter detects the translatable text based on column definitions provided in the Column definitions table (see below).

Number of Columns

This group tells the filter how to detect the number of columns in a table.

Defined by values — Number of columns is detected for every individual row, not for the whole table. If different rows contain different number of values, then different number of TUs will be sent for different rows

Defined by column names — Number of columns in the table is determined by the number of column names. If the number of actual values in a row exceeds the number of column names, values in extra columns are dropped. If some expected data are missing in some rows, empty TUs are created for the missing columns data.

Fixed number of columns — Number of columns is explicitly specified by the spinner value (1-100, default 2). Extra columns are dropped, empty TUs are created for missing columns.

Column Definitions

You can add or modify definitions for columns of your table.

Every column has a 1-based index and a type:

  • Source — the column contains text in a source language.
  • Source ID — the column provides a unique ID for a source column. This ID becomes the name of the created text unit resource.
  • Target — the column contains text in target language for a given source column.
  • Comment — the column contains a comment for a specified source column.
  • Record ID — the column provides an ID for the current record (row).

Every row in the table (can be multi-line by the means of text qualifiers) is considered a record. Every record can have a record ID (e.g. a database table primary key). It is possible not only to have several target columns for one source, but also several source columns in one table. To tell source columns one from another, you can specify an ID suffix. If a given source column doesn't have a source ID attached (in a source ID column), then the filter will append the ID suffix for that source column to the record ID, thus creating a name for the text unit.

Options Tab

Text Unit Processing

Allow trimming — For CSV table type this option works together with CSV actions - Exclude leading/trailing white spaces from extracted text.

  • Trim leading spaces and tabs - if selected, extracted text is trimmed left.
  • Trim trailing spaces and tabs - if selected, extracted text is trimmed right.

Convert \t \n \\ \uXXXX into characters — If selected, escape sequences are converted to regular characters.

Multi-Line Text Units

When extracted text is multi-line, this group controls the way of combining multiple lines in a single text unit:

Separate lines with line feeds — multiple lines are extracted like a text run with \n separating the original lines.

Unwrap lines — multiple lines are merged in a single text run, a space is inserted in-between the original lines.

Create inline codes for line breaks — multiple lines are extracted like a single text run with an inline code containing the original line break and separating the lines.

Inline Codes

Has inline codes as defined below — Set this option to use the specified regular expressions on the text of the extracted items. Any match will be converted to an inline code.

Add — Click this button to add a new rule.

Remove — Click this button to remove the current rule.

Move Up — Click this button to move the current rule upward.

Move down — Click this button to move the current rule downward.

[Top-right text box] — Enter the regular expression for the current rule. Use the Modify button to enter the edit mode. The expression must be a valid regular expression. You can check the syntax (and the effect of the rule) as it automatically tests it against the test data in the text box below and shows the result in the bottom-right text box.

Modify — Click this button to edit the expression of the current rule. This button is labeled Accept when you are in edit mode.

Accept — Click this button to save any changes you have made to the expression and leave the edit mode. This button is labeled Modify when you are not in edit mode.

Discard — Click this button to leave the edit mode and revert the current rule to the expression it had before you started the edit mode.

Patterns — Click this button to display some help on regular expression patterns.

Test using all rules — Set this option to test all the rules at the same time. The syntax of the current rule is automatically checked. See the effect it has on the sample text. The result of the test are displayed in the bottom right result box. The parts of the text that are matches of the expressions are displayed in <> brackets. If the Test using all rules option is set, the test takes all rules of the set in account, if it is not set only the current rule is tested.

[Middle-right text box] — Optional test data to test the regular expression for the current rule or all rules depending on the Test using all rules option.

[Bottom-right text box] — Shows the result of the regular expression applied to the test data.

Limitations

None known.