Regex Filter

From Okapi Framework
Jump to: navigation, search

Overview

The Regex Filter is an Okapi component that implements the IFilter interface for any type of text-based formats where the text can be captured using regular expressions. The filter is implemented in the class net.sf.okapi.filters.regex.RegexFilter of the library.

The filter can work with any text-based document. You define rules with regular expressions that indicate what part of the document to process. Each rule is associated with an action telling the filter what to do with the different capturing groups of its regular expression.

For example, if you have the following input document:

[ID1]=Text for ID1
[ID2]:Text for ID2

...and a rule with the following regular expression:

^\[(.*?)](=|:)(.*?)$

...and that rule is set to the action Extract the content and has the capturing group 3 assigned to the source group and the capturing group 1 assigned to the identifier group.

...then:

  • Each line in the input document will match the rule.
  • A new text unit will be created for each match, with its name set to the content of the capturing group 1, and its source text set to the content of the capturing group 3.
[ID1]=Text for ID1
[ID2]:Text for ID2
^\[(.*?)](=|:)(.*?)$

And if you were to represent the parsed information in XLIFF, it would look something like this:

...
<body>
 <trans-unit id="1" resname="ID1" xml:space="preserve">
  <source xml:lang="en">Text for ID1</source>
 </trans-unit>
 <trans-unit id="2" resname="ID2" xml:space="preserve">
  <source xml:lang="en">Text for ID2</source>
 </trans-unit>
</body>
...

Processing Details

Input Encoding

The filter decides which encoding to use for the input document using the following logic:

  • If the file has a Unicode Byte-Order-Mark:
    • Then, the corresponding encoding (e.g. UTF-8, UTF-16, etc.) is used.
  • Otherwise, the input encoding used is the default encoding that was specified when opening the document.

Output Encoding

The filter does not recognize any encoding declarations in the document, and therefore cannot update them.

If the output encoding is UTF-8:

  • If the input encoding was also UTF-8, a Byte-Order-Mark is used for the output document only if one was detected in the input document.
  • If the input encoding was not UTF-8, no Byte-Order-Mark is used in the output document.

Line-Breaks

The type of line-breaks of the output is the same as the one of the original input.

Parsing

Here is how an input document is parsed:

  1. The filter sets the current search position at the top of the document.
  2. It searches for the first possible rule that has a match from a current position.
  3. It takes the match and applies whatever action is associated with the rule.
  4. It moves the current search position at the end of the match.
  5. The steps 2, 3, and 4 are repeated until no more matches are found or the search position reaches the end of the document.

Actions

Each rule is associated with one of several possible actions. Depending on the action, you can associate different parts of the text that matches the rule with a specific role. This is done with the capturing groups. The source group, the target group, the identifier group and the note group.

A capturing group is a part of the regular expression between parentheses. The capturing group 0 is the whole match, then other capturing groups are numbered by counting their opening parentheses from left to right. For example, in the expression (A)(B(C)) there are three groups:

  1. (A)
  2. (B(C))
  3. (C)

The following table summarizes what each action does, and what the different groups it may use:

Action Effect Source Target Identifier Note
Extract the strings in the source group Sends a TEXT_UNIT event for each string found in the source capturing group. Must be defined. It is where the string or strings to extract are taken from. Not used. If defined: It is the name for the first text unit. If there is more than one string to extract, a sequential number (starting at 2) is appended to it, and used as the name of the other text units. If defined: It is the note property associated to each text unit corresponding to each extracted string.
Extract the content of the source group Sends a single TEXT_UNIT event based on the different capturing groups. Must be defined. It is the source text of the text unit. If defined: It is the target text of the text unit. If defined: It is the name of the text unit. If defined: It is the note property associated to the text unit.
Treat the source group as comment Process the source capturing group for localization directives (if requested) and leaves the content of the whole expression's match untouched. Must be defined. It is processed for localization directives if that option is set. Not used. Not used. Not used.
Do not extract Leaves the content of the whole expression's match untouched. Not used. Not used. Not used. Not used.
Start a section Sends a START_GROUP event. If the option Auto-close previous section when a new one starts is set, you must not define a corresponding end section. If that option is not set, you must define a rule to close this section. Not used. Not used. If defined: It is the name of the section being opened. A section corresponds to a <group> in XLIFF. If defined: It is the note property associated to the section being opened.
End a section Sends an END_GROUP event. Not used. Not used. Not used. Not used.

Parameters

Rules Tab

Add — Click this button to add a new rule to the list. This opens the Edit Rule dialog box with the new rule.

Rename — Click this button to rename the rule currently selected. Note that two rules can have the same name, but this is obviously not recommended.

Remove — Click this button to delete the rule currently selected from the list. No confirmation is asked.

Edit — Click this button to edit the rule currently selected. This opens the Edit Rule dialog box.

Move Up — Click this button to move the rule currently selected up in the list. Rules are evaluated in the order of the list.

Move Down — Click this button to move the rule currently selected down in the list. Rules are evaluated in the order of the list.

Rule properties

Preserve white spaces — Set this option to preserve all white spaces of the extracted text. If this option is not set the extracted content is unwrapped: That is any sequence of consecutive white spaces is replaced by a single space character, and any white space character at the start or the end of the content is trimmed out. White spaces here are: spaces, tabs, carriage returns, and line-feeds.

Has inline codes — Set this option to enable the conversion of some part of the extracted text into inline codes.

Edit Inline Codes Patterns — Click this button to open the Inline Codes Patterns dialog box where you can define rules for converting parts of text into inline codes.

Auto-close previous section when a new one starts — Set this option to automatically close any opened section when a new one is starting. Section are defined with the Start a section action. This option allows you to define only start of sections. If this option is not set, each Start a section action must have a corresponding End a section action.

Regular expressions options

This set of options are used for all rules defined in the list. If you need to overwrite an option for a given rule, use the (?idmsux-idmsux) construct in the pattern for that rule.

Dot also matches line-feed — Set this option to enable the dot operator to match line-feeds.

Multi-line — Set this option so the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. If this option is not set these expressions only match at the beginning and the end of the entire input sequence.

Ignore case differences — Set this option to ignore differences between letter cases. If this option is set "abc" is seen as identical as "Abc". If this option is not set, both strings are seen as different.

Options Tab

Localization directives

Use localization directives when they are present — Set this option to enable the filter to recognize localization directives. If this option is not set, any localization directive in the input file will be ignored.

Extract items outside the scope of localization directives — Set this option to extract any translatable item that is not within the scope of a localization directive. Selecting to extract or not outside localization directives allows you to mark up fewer parts of the source document. This option is enabled only when the Use localization directives when they are present option is set.

Strings

Beginning of string — Enter the character specifying the start of a string. Entering several characters defines several ways to start a string.

End of string — Enter the character specifying the end of a string. If you have defined several beginning characters, you must defined an equal number of end characters, and the position of each end character must correspond to the position of its corresponding beginning character.

Escaped characters use back-slash prefix — Set this option if the way to escape a character is to have a back-slash prefix (e.g. \").

Escaped characters are doubled — Set this option if the way to escape a character is to double it (e.g. "").

Content type

MIME type of the document — Enter the MIME type value to use when extracting content with this parameters. The value is used to identify the type of document. It may also change the way the text is written back into the original format. Most of the time text/plain should be fine.

Limitations

  • The whole document is loaded in memory to apply the regular expressions. This may cause issues with very large documents.
  • The option Extract strings outside the rules is not yet implemented.