Package net.sf.okapi.lib.segmentation
Class SRXSegmenter
- java.lang.Object
-
- net.sf.okapi.lib.segmentation.SRXSegmenter
-
- All Implemented Interfaces:
ISegmenter
public class SRXSegmenter extends Object implements ISegmenter
Implements theISegmenter
interface for SRX rules.
-
-
Constructor Summary
Constructors Constructor Description SRXSegmenter()
Creates a new SRXSegmenter object.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
addRule(net.sf.okapi.lib.segmentation.CompiledRule compiledRule)
Adds a compiled rule to this segmenter.boolean
cascade()
Indicates if cascading must be applied when selecting the rules for a given language pattern.int
computeSegments(String text)
Calculate the segmentation of a given plain text string.int
computeSegments(TextContainer container)
Calculates the segmentation of a given TextContainer object.LocaleId
getLanguage()
Gets the language used to apply the rules.Range
getNextSegmentRange(TextContainer container)
Compute the range of the next segment for a given TextContainer object.List<Range>
getRanges()
Gets the list off all segments ranges calculated when callingISegmenter.computeSegments(String)
, orISegmenter.computeSegments(TextContainer)
.List<Integer>
getSplitPositions()
Gets the list of all the split positions in the text that was last segmented.boolean
includeEndCodes()
Indicates if end codes should be included (See SRX implementation notes).boolean
includeIsolatedCodes()
Indicates if isolated codes should be included (See SRX implementation notes).boolean
includeStartCodes()
Indicates if start codes should be included (See SRX implementation notes).boolean
oneSegmentIncludesAll()
Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)void
reset()
Resets the options to their defaults, and the compiled rules to nothing.boolean
segmentSubFlows()
Indicates if sub-flows must be segmented.protected void
setCascade(boolean value)
Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.void
setIncludeEndCodes(boolean includeEndCodes)
void
setIncludeIsolatedCodes(boolean includeIsolatedCodes)
void
setIncludeStartCodes(boolean includeStartCodes)
void
setLanguage(LocaleId languageCode)
Sets the locale used to apply the rules.protected void
setMaskRule(String pattern)
Sets the pattern for the mask rule.void
setOneSegmentIncludesAll(boolean oneSegmentIncludesAll)
void
setOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS)
Sets the options for this segmenter.void
setOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS, boolean useJavaRegex, boolean useIcu4JBreakRules, boolean treatIsolatedCodesAsWhitespace)
Sets the options for this segmenter.void
setSegmentSubFlows(boolean segmentSubFlows)
void
setTreatIsolatedCodesAsWhitespace(boolean treatIsolatedCodesAsWhitespace)
void
setTrimCodes(boolean trimCodes)
void
setTrimLeadingWS(boolean trimLeadingWS)
void
setTrimTrailingWS(boolean trimTrailingWS)
void
setUseJavaRegex(boolean useJavaRegex)
Sets the indicator that tells if this document has rules that are defined for the Java regular expression engine (vs ICU).boolean
treatIsolatedCodesAsWhitespace()
Indicate if the segmenter should treat each isolated code as a single whitespace character (U+0020) when applying segmentation.boolean
trimLeadingWhitespaces()
Indicates if leading white-spaces should be left outside the segments.boolean
trimTrailingWhitespaces()
Indicates if trailing white-spaces should be left outside the segments.boolean
useJavaRegex()
Indicates if this document has rules that are defined for the Java regular expression engine (vs ICU).
-
-
-
Method Detail
-
reset
public void reset()
Description copied from interface:ISegmenter
Resets the options to their defaults, and the compiled rules to nothing.- Specified by:
reset
in interfaceISegmenter
-
setOptions
public void setOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS, boolean useJavaRegex, boolean useIcu4JBreakRules, boolean treatIsolatedCodesAsWhitespace)
Sets the options for this segmenter.- Parameters:
segmentSubFlows
- true to segment sub-flows, false to no segment them.includeStartCodes
- true to include start codes just before a break in the 'left' segment, false to put them in the next segment.includeEndCodes
- true to include end codes just before a break in the 'left' segment, false to put them in the next segment.includeIsolatedCodes
- true to include isolated codes just before a break in the 'left' segment, false to put them in the next segment.oneSegmentIncludesAll
- true to include everything in segments that are alone.trimLeadingWS
- true to trim leading white-spaces from the segments, false to keep them.trimTrailingWS
- true to trim trailing white-spaces from the segments, false to keep them.useJavaRegex
- true if the rules are for the Java regular expression engine, false if they are for ICU.treatIsolatedCodesAsWhitespace
- if true then the isolated code markers in codedText get converted to spaces, so that they don't get in the way of the rules. If false, the codes are simply removed.
-
setOptions
public void setOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS)
Description copied from interface:ISegmenter
Sets the options for this segmenter.- Specified by:
setOptions
in interfaceISegmenter
- Parameters:
segmentSubFlows
- true to segment sub-flows, false to no segment them.includeStartCodes
- true to include start codes just before a break in the 'left' segment, false to put them in the next segment.includeEndCodes
- true to include end codes just before a break in the 'left' segment, false to put them in the next segment.includeIsolatedCodes
- true to include isolated codes just before a break in the 'left' segment, false to put them in the next segment.oneSegmentIncludesAll
- true to include everything in segments that are alone.trimLeadingWS
- true to trim leading white-spaces from the segments, false to keep them.trimTrailingWS
- true to trim trailing white-spaces from the segments, false to keep them.
-
oneSegmentIncludesAll
public boolean oneSegmentIncludesAll()
Description copied from interface:ISegmenter
Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)- Specified by:
oneSegmentIncludesAll
in interfaceISegmenter
- Returns:
- true if a text with a single segment should include the whole text.
-
segmentSubFlows
public boolean segmentSubFlows()
Description copied from interface:ISegmenter
Indicates if sub-flows must be segmented.- Specified by:
segmentSubFlows
in interfaceISegmenter
- Returns:
- true if sub-flows must be segmented, false otherwise.
-
cascade
public boolean cascade()
Indicates if cascading must be applied when selecting the rules for a given language pattern.- Returns:
- true if cascading must be applied, false otherwise.
-
trimLeadingWhitespaces
public boolean trimLeadingWhitespaces()
Description copied from interface:ISegmenter
Indicates if leading white-spaces should be left outside the segments.- Specified by:
trimLeadingWhitespaces
in interfaceISegmenter
- Returns:
- true if the leading white-spaces should be trimmed.
-
trimTrailingWhitespaces
public boolean trimTrailingWhitespaces()
Description copied from interface:ISegmenter
Indicates if trailing white-spaces should be left outside the segments.- Specified by:
trimTrailingWhitespaces
in interfaceISegmenter
- Returns:
- true if the trailing white-spaces should be trimmed.
-
useJavaRegex
public boolean useJavaRegex()
Indicates if this document has rules that are defined for the Java regular expression engine (vs ICU).- Returns:
- true if the rules are for the Java regular expression engine, false if they are for ICU.
-
treatIsolatedCodesAsWhitespace
public boolean treatIsolatedCodesAsWhitespace()
Description copied from interface:ISegmenter
Indicate if the segmenter should treat each isolated code as a single whitespace character (U+0020) when applying segmentation.- Specified by:
treatIsolatedCodesAsWhitespace
in interfaceISegmenter
- Returns:
- true if the segmenter should treat isolated codes as whitespace
-
setUseJavaRegex
public void setUseJavaRegex(boolean useJavaRegex)
Sets the indicator that tells if this document has rules that are defined for the Java regular expression engine (vs ICU).- Parameters:
useJavaRegex
- true if the rules should be treated as Java regular expression, false for ICU.
-
includeStartCodes
public boolean includeStartCodes()
Description copied from interface:ISegmenter
Indicates if start codes should be included (See SRX implementation notes).- Specified by:
includeStartCodes
in interfaceISegmenter
- Returns:
- true if they should be included, false otherwise.
-
includeEndCodes
public boolean includeEndCodes()
Description copied from interface:ISegmenter
Indicates if end codes should be included (See SRX implementation notes).- Specified by:
includeEndCodes
in interfaceISegmenter
- Returns:
- true if they should be included, false otherwise.
-
includeIsolatedCodes
public boolean includeIsolatedCodes()
Description copied from interface:ISegmenter
Indicates if isolated codes should be included (See SRX implementation notes).- Specified by:
includeIsolatedCodes
in interfaceISegmenter
- Returns:
- true if they should be included, false otherwise.
-
computeSegments
public int computeSegments(String text)
Description copied from interface:ISegmenter
Calculate the segmentation of a given plain text string.- Specified by:
computeSegments
in interfaceISegmenter
- Parameters:
text
- plain text to segment.- Returns:
- the number of segments calculated.
-
computeSegments
public int computeSegments(TextContainer container)
Description copied from interface:ISegmenter
Calculates the segmentation of a given TextContainer object. If the content is already segmented, it is un-segmented automatically before being processed.- Specified by:
computeSegments
in interfaceISegmenter
- Parameters:
container
- the object to segment.- Returns:
- the number of segments calculated.
-
getNextSegmentRange
public Range getNextSegmentRange(TextContainer container)
Description copied from interface:ISegmenter
Compute the range of the next segment for a given TextContainer object. The next segment is searched from the first character after the last segment marker found in the container.- Specified by:
getNextSegmentRange
in interfaceISegmenter
- Parameters:
container
- the text container where to look for the next segment.- Returns:
- a range corresponding to the start and end position of the found segment, or null if no more segments are found.
-
getSplitPositions
public List<Integer> getSplitPositions()
Description copied from interface:ISegmenter
Gets the list of all the split positions in the text that was last segmented. You must callISegmenter.computeSegments(TextContainer)
orISegmenter.computeSegments(String)
before calling this method. A split position is the first character position of a new segment.IMPORTANT: The position returned here are the position WITHOUT taking in account any options for trimming or not leading and trailing white-spaces.
- Specified by:
getSplitPositions
in interfaceISegmenter
- Returns:
- An array of integers where each value is a split position in the coded text that was segmented.
-
getRanges
public List<Range> getRanges()
Description copied from interface:ISegmenter
Gets the list off all segments ranges calculated when callingISegmenter.computeSegments(String)
, orISegmenter.computeSegments(TextContainer)
.- Specified by:
getRanges
in interfaceISegmenter
- Returns:
- the list of all segments ranges. each range is stored in
a
Range
object where start is the start and end the end of the range. Returns null if no ranges have been defined yet.
-
getLanguage
public LocaleId getLanguage()
Description copied from interface:ISegmenter
Gets the language used to apply the rules.- Specified by:
getLanguage
in interfaceISegmenter
- Returns:
- the language code used to apply the rules, or null, if none has been specified.
-
setLanguage
public void setLanguage(LocaleId languageCode)
Description copied from interface:ISegmenter
Sets the locale used to apply the rules.- Specified by:
setLanguage
in interfaceISegmenter
- Parameters:
languageCode
- Code of the language to use to apply the rules.
-
setCascade
protected void setCascade(boolean value)
Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.- Parameters:
value
- true if cascading must be applied, false otherwise.
-
addRule
protected void addRule(net.sf.okapi.lib.segmentation.CompiledRule compiledRule)
Adds a compiled rule to this segmenter.- Parameters:
compiledRule
- the compiled rule to add.
-
setMaskRule
protected void setMaskRule(String pattern)
Sets the pattern for the mask rule.- Parameters:
pattern
- the new pattern to use for the mask rule.
-
setSegmentSubFlows
public void setSegmentSubFlows(boolean segmentSubFlows)
- Specified by:
setSegmentSubFlows
in interfaceISegmenter
-
setIncludeStartCodes
public void setIncludeStartCodes(boolean includeStartCodes)
- Specified by:
setIncludeStartCodes
in interfaceISegmenter
-
setIncludeEndCodes
public void setIncludeEndCodes(boolean includeEndCodes)
- Specified by:
setIncludeEndCodes
in interfaceISegmenter
-
setIncludeIsolatedCodes
public void setIncludeIsolatedCodes(boolean includeIsolatedCodes)
- Specified by:
setIncludeIsolatedCodes
in interfaceISegmenter
-
setOneSegmentIncludesAll
public void setOneSegmentIncludesAll(boolean oneSegmentIncludesAll)
- Specified by:
setOneSegmentIncludesAll
in interfaceISegmenter
-
setTrimLeadingWS
public void setTrimLeadingWS(boolean trimLeadingWS)
- Specified by:
setTrimLeadingWS
in interfaceISegmenter
-
setTrimTrailingWS
public void setTrimTrailingWS(boolean trimTrailingWS)
- Specified by:
setTrimTrailingWS
in interfaceISegmenter
-
setTrimCodes
public void setTrimCodes(boolean trimCodes)
- Specified by:
setTrimCodes
in interfaceISegmenter
-
setTreatIsolatedCodesAsWhitespace
public void setTreatIsolatedCodesAsWhitespace(boolean treatIsolatedCodesAsWhitespace)
- Specified by:
setTreatIsolatedCodesAsWhitespace
in interfaceISegmenter
-
-