RatelSRX and Java |
If you are using an Okapi Tool after the M9 release, you should be using the wiki online help:
http://www.opentag.com/okapi/wiki/index.php?title=SRX_and_Java
The SRX 2.0 standard is based on the ICU regular expression notation.
Ratel uses Java's regular expressions to implement SRX. One of the reasons for this is because ICU4J (ICU for Java) does not provide support of ICU regular expressions.
As of
version 1.6 Java does not have support for some of the Unicode-enabled features
as described in ICU. For example in Java "\w" means "[a-zA-Z_0-9]"
not "[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]" like in ICU. Some ICU
features can be replaced by an equivalent expression in Java, but some other
features simply cannot be implemented in Java.
The following table shows the ICU and Java differences. The yellow entries denote a case where the ICU expression needs to be mapped to a Java equivalent (sometimes a complex one), and the red entries indictate the cases where the ICU expression cannot be mapped in Java.
| ICU Meta Character | Java Equivalent | ICU Description |
|---|---|---|
|
|
same |
Match a BELL, |
|
|
same |
Match at the beginning of the input. Differs
from |
|
|
\b exists but does not have exactly the same behavior. |
Match if the current position is a word
boundary. Boundaries occur at the transitions betweem word
( |
|
|
\b is invalid when within a set.Use \u0008 instead. |
Match a BACKSPACE, \u0008. |
|
|
\B exists but does not have exactly the same behavior. |
Match if the current position is not a word boundary. And the option UREGEX_UWORD is assumed to be NOT set (default). |
|
|
same |
Match a control-X character. |
|
|
\d exists but is ASCII based.Use [\p{Nd}] instead. |
Match any character with the Unicode General Category of Nd (Number, Decimal Digit.) |
|
|
\D exists but is ASCII based.Use [^\p{Nd}] instead. |
Match any character that is not a decimal digit. |
|
|
same |
Match an ESCAPE, |
|
|
same |
Terminates a |
|
|
same |
Match a FORM FEED, |
|
|
same |
Match if the current position is at the end of the previous match. |
|
|
same |
Match a LINE FEED, |
|
|
Does not exists |
Match the named character. |
|
|
same |
Match any character with the specified Unicode Property. |
|
|
same |
Match any character not having the specified Unicode Property. |
|
|
same |
Quotes all following characters until |
|
|
same |
Match a CARRIAGE RETURN, |
|
|
\s exists but is ASCII based (it matches [ \t\n\x0B\f\r])Use [\t\n\f\r\p{Z}] instead. |
Match a white space character. White space is
defined as |
|
|
\S exists but is ASCII basedUse [^\t\n\f\r\p{Z}] instead. |
Match a non-white space character. |
|
|
same |
Match a HORIZONTAL TABULATION, |
|
|
same |
Match the character with the hex value |
|
|
Does not exist |
Match the character with the hex value
|
|
|
\w exists but is ASCII based.Use [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}] instead. |
Match a word character. Word characters are
|
|
|
\W exists but is ASCII basedUse [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}] instead. |
Match a non-word character. |
|
|
Does not exists Use \uhhhh instead. |
Match the character with hex value |
|
|
same |
Match the character with two digit hex value
|
|
|
Does not exists |
Match a Grapheme Cluster. |
|
|
same |
Match if the current position is at the end of input, but before the final line terminator, if one exists. |
|
|
same |
Match if the current position is at the end of input. |
|
|
same |
Match the character with octal value |
|
|
same |
Back Reference. Match whatever the nth capturing group matched. n must be >1 and < total number of capture groups in the pattern. |
|
|
same |
Match any one character from the set. See UnicodeSet for a full description of what may appear in the pattern. |
|
|
same |
Match any character. |
|
|
same |
Match at the beginning of a line. |
|
|
same |
Match at the end of a line. |
|
|
same |
Quotes the following character. Characters
that must be quoted to be treated as literals are |