Okapi Framework - FiltersRegex Filter |
|
- Overview | |
If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://okapiframework.org/wiki/index.php?title=Regex_Filter
The Regex Filter is an Okapi component that implements the IFilter
interface for any type of text-based formats where the
text can be captured using regular expressions. The filter is implemented in the class
net.sf.okapi.filters.regex.RegexFilter of the Okapi library.
The filter can work with any text-based document. You define rules with regular expressions that indicate what part of the document to process. Each rule is associated with an action telling the filter what to do with the parts corresponding to its regular expression. Capturing groups in the regular expression allow you have the action do different things with sections of the matched text.
For example, if you have the following input document:
[ID1]=Text for ID1 [ID2]:Text for ID2
...and a rule with the following regular expression:
^\[(.*?)](=|:)(.*?)$
...and that rule is set to the action Extract the content and has the capturing group 3 assigned to the source group and the capturing group 1 assigned to the identifier group.
Then:
[ID1]=Text for ID1 [ID2]:Text for ID2 ^\[(.*?)](=|:)(.*?)$
And if you were to represent the parsed information in XLIFF, it would look something like this:
... <body> <trans-unit id="1" resname="ID1" xml:space="preserve"> <source xml:lang="EN-US">Text for ID1</source> </trans-unit> <trans-unit id="2" resname="ID2" xml:space="preserve"> <source xml:lang="EN-US">Text for ID2</source> </trans-unit> </body> ...
The filter decides which encoding to use for the input document using the following logic:
The filter does not recognize any encoding declarations in the document, and therefore cannot update them.
If the output encoding is UTF-8:
The type of line-breaks of the output is the same as the one of the original input.
Here is how an input document is parsed:
Each rule is associated with one of several possible actions. Depending on the action, you can associate different parts of the text that matches the rule with a specific role. This is done with the capturing groups. The source group, the target group, the identifier group and the note group.
A capturing group is a part of the regular expression between
parentheses. The capturing group 0 is the whole match, then other capturing
groups are numbered by counting their opening parentheses from left to right.
For example, in the expression (A)(B(C)) there are three groups:
(A)
(B(C))
(C)
The following table summarizes what each action does, and what the different groups it may use:
Add -- Click this button to add a new rule to the list. This opens the Edit Rule dialog box with the new rule.
Rename -- Click this button to rename the rule currently selected. Note that two rules can have the same name, but this is obviously not recommended.
Remove -- Click this button to delete the rule currently selected from the list. No confirmation is asked.
Edit -- Click this button to edit the rule currently selected. This opens the Edit Rule dialog box.
Move Up -- Click this button to move the rule currently selected up in the list. Rules are evaluated in the order of the list.
Move Down -- Click this button to move the rule currently selected down in the list. Rules are evaluated in the order of the list.
Preserve white spaces -- Set this option to preserve all white spaces of the extracted text. If this option is not set the extracted content is unwrapped: That is any sequence of consecutive white spaces is replaced by a single space character, and any white space character at the start or the end of the content is trimmed out. White spaces here are: spaces, tabs, carriage returns, and line-feeds.
Has inline codes -- Set this option to enable the conversion of some part of the extracted text into inline codes.
Edit Inline Codes Patterns -- Click this button to open the Inline Codes Patterns dialog box where you can define rules for converting parts of text into inline codes.
Auto-close previous section when a new one starts -- Set this option to automatically close any opened section when a new one is starting. Section are defined with the Start a section action. This option allows you to define only start of sections. If this option is not set, each start of section must correspond to an end of section.
This set of options are used for all rules defined in the list. If you need
to overwrite an option for a given rule, use the (?idmsux-idmsux)
construct in the pattern for that rule.
Dot also matches line-feed -- Set this option to enable the dot operator to match line-feeds.
Multi-line -- Set this option so the expressions ^
and $ match just after or just before, respectively, a line
terminator or the end of the input sequence. If this option is not set these
expressions only match at the beginning and the end of the entire input
sequence.
Ignore case differences -- Set this option to ignore differences between letter cases. If this option is set "abc" is seen as identical as "Abc". If this option is not set, both strings are seen as different.
Use localization directives when they are present -- Set this option to enable the filter to recognize localization directives. If this option is not set, any localization directive in the input file will be ignored.
Extract items outside the scope of localization directives -- Set this option to extract any translatable item that is not within the scope of a localization directive. Selecting to extract or not outside localization directives allows you to mark up fewer parts of the source document. This option is enabled only when the Use localization directives when they are present option is set.
Extract strings outside the rules -- Set this option to extract all strings that are outside the scope of all the defined rules. NOT IMPLEMENTED YET.
Beginning of string -- Enter the character specifying the start of a string. Entering several characters defines several ways to start a string.
End of string -- Enter the character specifying the end of a string. If you have defined several beginning characters, you must defined an equal number of end characters, and the position of each end character must correspond to the position of its corresponding beginning character.
MIME type of the document -- Enter the MIME type value to use
when extracting content with this parameters. The value is used to identify the
type of document. It may also change the way the text is written back into the
original format. Most of the time text/plain should be fine.