Package net.sf.okapi.lib.segmentation
Class SRXSegmenter
- java.lang.Object
-
- net.sf.okapi.lib.segmentation.SRXSegmenter
-
- All Implemented Interfaces:
ISegmenter
public class SRXSegmenter extends Object implements ISegmenter
Implements theISegmenterinterface for SRX rules.
-
-
Constructor Summary
Constructors Constructor Description SRXSegmenter()Creates a new SRXSegmenter object.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected voidaddRule(net.sf.okapi.lib.segmentation.CompiledRule compiledRule)Adds a compiled rule to this segmenter.booleancascade()Indicates if cascading must be applied when selecting the rules for a given language pattern.intcomputeSegments(String text)Calculate the segmentation of a given plain text string.intcomputeSegments(TextContainer container)Calculates the segmentation of a given TextContainer object.LocaleIdgetLanguage()Gets the language used to apply the rules.RangegetNextSegmentRange(TextContainer container)Compute the range of the next segment for a given TextContainer object.List<Range>getRanges()Gets the list off all segments ranges calculated when callingISegmenter.computeSegments(String), orISegmenter.computeSegments(TextContainer).List<Integer>getSplitPositions()Gets the list of all the split positions in the text that was last segmented.booleanincludeEndCodes()Indicates if end codes should be included (See SRX implementation notes).booleanincludeIsolatedCodes()Indicates if isolated codes should be included (See SRX implementation notes).booleanincludeStartCodes()Indicates if start codes should be included (See SRX implementation notes).booleanoneSegmentIncludesAll()Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)voidreset()Resets the options to their defaults, and the compiled rules to nothing.booleansegmentSubFlows()Indicates if sub-flows must be segmented.protected voidsetCascade(boolean value)Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.voidsetIncludeEndCodes(boolean includeEndCodes)voidsetIncludeIsolatedCodes(boolean includeIsolatedCodes)voidsetIncludeStartCodes(boolean includeStartCodes)voidsetLanguage(LocaleId languageCode)Sets the locale used to apply the rules.protected voidsetMaskRule(String pattern)Sets the pattern for the mask rule.voidsetOneSegmentIncludesAll(boolean oneSegmentIncludesAll)voidsetOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS)Sets the options for this segmenter.voidsetOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS, boolean useJavaRegex, boolean useIcu4JBreakRules, boolean treatIsolatedCodesAsWhitespace)Sets the options for this segmenter.voidsetSegmentSubFlows(boolean segmentSubFlows)voidsetTreatIsolatedCodesAsWhitespace(boolean treatIsolatedCodesAsWhitespace)voidsetTrimCodes(boolean trimCodes)voidsetTrimLeadingWS(boolean trimLeadingWS)voidsetTrimTrailingWS(boolean trimTrailingWS)voidsetUseJavaRegex(boolean useJavaRegex)Sets the indicator that tells if this document has rules that are defined for the Java regular expression engine (vs ICU).booleantreatIsolatedCodesAsWhitespace()Indicate if the segmenter should treat each isolated code as a single whitespace character (U+0020) when applying segmentation.booleantrimLeadingWhitespaces()Indicates if leading white-spaces should be left outside the segments.booleantrimTrailingWhitespaces()Indicates if trailing white-spaces should be left outside the segments.booleanuseJavaRegex()Indicates if this document has rules that are defined for the Java regular expression engine (vs ICU).
-
-
-
Method Detail
-
reset
public void reset()
Description copied from interface:ISegmenterResets the options to their defaults, and the compiled rules to nothing.- Specified by:
resetin interfaceISegmenter
-
setOptions
public void setOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS, boolean useJavaRegex, boolean useIcu4JBreakRules, boolean treatIsolatedCodesAsWhitespace)Sets the options for this segmenter.- Parameters:
segmentSubFlows- true to segment sub-flows, false to no segment them.includeStartCodes- true to include start codes just before a break in the 'left' segment, false to put them in the next segment.includeEndCodes- true to include end codes just before a break in the 'left' segment, false to put them in the next segment.includeIsolatedCodes- true to include isolated codes just before a break in the 'left' segment, false to put them in the next segment.oneSegmentIncludesAll- true to include everything in segments that are alone.trimLeadingWS- true to trim leading white-spaces from the segments, false to keep them.trimTrailingWS- true to trim trailing white-spaces from the segments, false to keep them.useJavaRegex- true if the rules are for the Java regular expression engine, false if they are for ICU.treatIsolatedCodesAsWhitespace- if true then the isolated code markers in codedText get converted to spaces, so that they don't get in the way of the rules. If false, the codes are simply removed.
-
setOptions
public void setOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS)Description copied from interface:ISegmenterSets the options for this segmenter.- Specified by:
setOptionsin interfaceISegmenter- Parameters:
segmentSubFlows- true to segment sub-flows, false to no segment them.includeStartCodes- true to include start codes just before a break in the 'left' segment, false to put them in the next segment.includeEndCodes- true to include end codes just before a break in the 'left' segment, false to put them in the next segment.includeIsolatedCodes- true to include isolated codes just before a break in the 'left' segment, false to put them in the next segment.oneSegmentIncludesAll- true to include everything in segments that are alone.trimLeadingWS- true to trim leading white-spaces from the segments, false to keep them.trimTrailingWS- true to trim trailing white-spaces from the segments, false to keep them.
-
oneSegmentIncludesAll
public boolean oneSegmentIncludesAll()
Description copied from interface:ISegmenterIndicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)- Specified by:
oneSegmentIncludesAllin interfaceISegmenter- Returns:
- true if a text with a single segment should include the whole text.
-
segmentSubFlows
public boolean segmentSubFlows()
Description copied from interface:ISegmenterIndicates if sub-flows must be segmented.- Specified by:
segmentSubFlowsin interfaceISegmenter- Returns:
- true if sub-flows must be segmented, false otherwise.
-
cascade
public boolean cascade()
Indicates if cascading must be applied when selecting the rules for a given language pattern.- Returns:
- true if cascading must be applied, false otherwise.
-
trimLeadingWhitespaces
public boolean trimLeadingWhitespaces()
Description copied from interface:ISegmenterIndicates if leading white-spaces should be left outside the segments.- Specified by:
trimLeadingWhitespacesin interfaceISegmenter- Returns:
- true if the leading white-spaces should be trimmed.
-
trimTrailingWhitespaces
public boolean trimTrailingWhitespaces()
Description copied from interface:ISegmenterIndicates if trailing white-spaces should be left outside the segments.- Specified by:
trimTrailingWhitespacesin interfaceISegmenter- Returns:
- true if the trailing white-spaces should be trimmed.
-
useJavaRegex
public boolean useJavaRegex()
Indicates if this document has rules that are defined for the Java regular expression engine (vs ICU).- Returns:
- true if the rules are for the Java regular expression engine, false if they are for ICU.
-
treatIsolatedCodesAsWhitespace
public boolean treatIsolatedCodesAsWhitespace()
Description copied from interface:ISegmenterIndicate if the segmenter should treat each isolated code as a single whitespace character (U+0020) when applying segmentation.- Specified by:
treatIsolatedCodesAsWhitespacein interfaceISegmenter- Returns:
- true if the segmenter should treat isolated codes as whitespace
-
setUseJavaRegex
public void setUseJavaRegex(boolean useJavaRegex)
Sets the indicator that tells if this document has rules that are defined for the Java regular expression engine (vs ICU).- Parameters:
useJavaRegex- true if the rules should be treated as Java regular expression, false for ICU.
-
includeStartCodes
public boolean includeStartCodes()
Description copied from interface:ISegmenterIndicates if start codes should be included (See SRX implementation notes).- Specified by:
includeStartCodesin interfaceISegmenter- Returns:
- true if they should be included, false otherwise.
-
includeEndCodes
public boolean includeEndCodes()
Description copied from interface:ISegmenterIndicates if end codes should be included (See SRX implementation notes).- Specified by:
includeEndCodesin interfaceISegmenter- Returns:
- true if they should be included, false otherwise.
-
includeIsolatedCodes
public boolean includeIsolatedCodes()
Description copied from interface:ISegmenterIndicates if isolated codes should be included (See SRX implementation notes).- Specified by:
includeIsolatedCodesin interfaceISegmenter- Returns:
- true if they should be included, false otherwise.
-
computeSegments
public int computeSegments(String text)
Description copied from interface:ISegmenterCalculate the segmentation of a given plain text string.- Specified by:
computeSegmentsin interfaceISegmenter- Parameters:
text- plain text to segment.- Returns:
- the number of segments calculated.
-
computeSegments
public int computeSegments(TextContainer container)
Description copied from interface:ISegmenterCalculates the segmentation of a given TextContainer object. If the content is already segmented, it is un-segmented automatically before being processed.- Specified by:
computeSegmentsin interfaceISegmenter- Parameters:
container- the object to segment.- Returns:
- the number of segments calculated.
-
getNextSegmentRange
public Range getNextSegmentRange(TextContainer container)
Description copied from interface:ISegmenterCompute the range of the next segment for a given TextContainer object. The next segment is searched from the first character after the last segment marker found in the container.- Specified by:
getNextSegmentRangein interfaceISegmenter- Parameters:
container- the text container where to look for the next segment.- Returns:
- a range corresponding to the start and end position of the found segment, or null if no more segments are found.
-
getSplitPositions
public List<Integer> getSplitPositions()
Description copied from interface:ISegmenterGets the list of all the split positions in the text that was last segmented. You must callISegmenter.computeSegments(TextContainer)orISegmenter.computeSegments(String)before calling this method. A split position is the first character position of a new segment.IMPORTANT: The position returned here are the position WITHOUT taking in account any options for trimming or not leading and trailing white-spaces.
- Specified by:
getSplitPositionsin interfaceISegmenter- Returns:
- An array of integers where each value is a split position in the coded text that was segmented.
-
getRanges
public List<Range> getRanges()
Description copied from interface:ISegmenterGets the list off all segments ranges calculated when callingISegmenter.computeSegments(String), orISegmenter.computeSegments(TextContainer).- Specified by:
getRangesin interfaceISegmenter- Returns:
- the list of all segments ranges. each range is stored in
a
Rangeobject where start is the start and end the end of the range. Returns null if no ranges have been defined yet.
-
getLanguage
public LocaleId getLanguage()
Description copied from interface:ISegmenterGets the language used to apply the rules.- Specified by:
getLanguagein interfaceISegmenter- Returns:
- the language code used to apply the rules, or null, if none has been specified.
-
setLanguage
public void setLanguage(LocaleId languageCode)
Description copied from interface:ISegmenterSets the locale used to apply the rules.- Specified by:
setLanguagein interfaceISegmenter- Parameters:
languageCode- Code of the language to use to apply the rules.
-
setCascade
protected void setCascade(boolean value)
Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.- Parameters:
value- true if cascading must be applied, false otherwise.
-
addRule
protected void addRule(net.sf.okapi.lib.segmentation.CompiledRule compiledRule)
Adds a compiled rule to this segmenter.- Parameters:
compiledRule- the compiled rule to add.
-
setMaskRule
protected void setMaskRule(String pattern)
Sets the pattern for the mask rule.- Parameters:
pattern- the new pattern to use for the mask rule.
-
setSegmentSubFlows
public void setSegmentSubFlows(boolean segmentSubFlows)
- Specified by:
setSegmentSubFlowsin interfaceISegmenter
-
setIncludeStartCodes
public void setIncludeStartCodes(boolean includeStartCodes)
- Specified by:
setIncludeStartCodesin interfaceISegmenter
-
setIncludeEndCodes
public void setIncludeEndCodes(boolean includeEndCodes)
- Specified by:
setIncludeEndCodesin interfaceISegmenter
-
setIncludeIsolatedCodes
public void setIncludeIsolatedCodes(boolean includeIsolatedCodes)
- Specified by:
setIncludeIsolatedCodesin interfaceISegmenter
-
setOneSegmentIncludesAll
public void setOneSegmentIncludesAll(boolean oneSegmentIncludesAll)
- Specified by:
setOneSegmentIncludesAllin interfaceISegmenter
-
setTrimLeadingWS
public void setTrimLeadingWS(boolean trimLeadingWS)
- Specified by:
setTrimLeadingWSin interfaceISegmenter
-
setTrimTrailingWS
public void setTrimTrailingWS(boolean trimTrailingWS)
- Specified by:
setTrimTrailingWSin interfaceISegmenter
-
setTrimCodes
public void setTrimCodes(boolean trimCodes)
- Specified by:
setTrimCodesin interfaceISegmenter
-
setTreatIsolatedCodesAsWhitespace
public void setTreatIsolatedCodesAsWhitespace(boolean treatIsolatedCodesAsWhitespace)
- Specified by:
setTreatIsolatedCodesAsWhitespacein interfaceISegmenter
-
-