Package net.sf.okapi.common
Interface ISegmenter
-
- All Known Implementing Classes:
SRXSegmenter
public interface ISegmenter
Common methods to provide segmentation facility to extracted content.
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description int
computeSegments(String text)
Calculate the segmentation of a given plain text string.int
computeSegments(TextContainer container)
Calculates the segmentation of a given TextContainer object.LocaleId
getLanguage()
Gets the language used to apply the rules.Range
getNextSegmentRange(TextContainer container)
Compute the range of the next segment for a given TextContainer object.List<Range>
getRanges()
Gets the list off all segments ranges calculated when callingcomputeSegments(String)
, orcomputeSegments(TextContainer)
.List<Integer>
getSplitPositions()
Gets the list of all the split positions in the text that was last segmented.boolean
includeEndCodes()
Indicates if end codes should be included (See SRX implementation notes).boolean
includeIsolatedCodes()
Indicates if isolated codes should be included (See SRX implementation notes).boolean
includeStartCodes()
Indicates if start codes should be included (See SRX implementation notes).boolean
oneSegmentIncludesAll()
Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)void
reset()
Resets the options to their defaults, and the compiled rules to nothing.boolean
segmentSubFlows()
Indicates if sub-flows must be segmented.void
setIncludeEndCodes(boolean includeEndCodes)
void
setIncludeIsolatedCodes(boolean includeIsolatedCodes)
void
setIncludeStartCodes(boolean includeStartCodes)
void
setLanguage(LocaleId locale)
Sets the locale used to apply the rules.void
setOneSegmentIncludesAll(boolean oneSegmentIncludesAll)
void
setOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS)
Sets the options for this segmenter.void
setSegmentSubFlows(boolean segmentSubFlows)
void
setTreatIsolatedCodesAsWhitespace(boolean treatIsolatedCodesAsWhitespace)
void
setTrimCodes(boolean trimCodes)
void
setTrimLeadingWS(boolean trimLeadingWS)
void
setTrimTrailingWS(boolean trimTrailingWS)
boolean
treatIsolatedCodesAsWhitespace()
Indicate if the segmenter should treat each isolated code as a single whitespace character (U+0020) when applying segmentation.boolean
trimLeadingWhitespaces()
Indicates if leading white-spaces should be left outside the segments.boolean
trimTrailingWhitespaces()
Indicates if trailing white-spaces should be left outside the segments.
-
-
-
Method Detail
-
computeSegments
int computeSegments(String text)
Calculate the segmentation of a given plain text string.- Parameters:
text
- plain text to segment.- Returns:
- the number of segments calculated.
-
computeSegments
int computeSegments(TextContainer container)
Calculates the segmentation of a given TextContainer object. If the content is already segmented, it is un-segmented automatically before being processed.- Parameters:
container
- the object to segment.- Returns:
- the number of segments calculated.
-
getNextSegmentRange
Range getNextSegmentRange(TextContainer container)
Compute the range of the next segment for a given TextContainer object. The next segment is searched from the first character after the last segment marker found in the container.- Parameters:
container
- the text container where to look for the next segment.- Returns:
- a range corresponding to the start and end position of the found segment, or null if no more segments are found.
-
getSplitPositions
List<Integer> getSplitPositions()
Gets the list of all the split positions in the text that was last segmented. You must callcomputeSegments(TextContainer)
orcomputeSegments(String)
before calling this method. A split position is the first character position of a new segment.IMPORTANT: The position returned here are the position WITHOUT taking in account any options for trimming or not leading and trailing white-spaces.
- Returns:
- An array of integers where each value is a split position in the coded text that was segmented.
-
getRanges
List<Range> getRanges()
Gets the list off all segments ranges calculated when callingcomputeSegments(String)
, orcomputeSegments(TextContainer)
.- Returns:
- the list of all segments ranges. each range is stored in
a
Range
object where start is the start and end the end of the range. Returns null if no ranges have been defined yet.
-
getLanguage
LocaleId getLanguage()
Gets the language used to apply the rules.- Returns:
- the language code used to apply the rules, or null, if none has been specified.
-
includeEndCodes
boolean includeEndCodes()
Indicates if end codes should be included (See SRX implementation notes).- Returns:
- true if they should be included, false otherwise.
-
includeIsolatedCodes
boolean includeIsolatedCodes()
Indicates if isolated codes should be included (See SRX implementation notes).- Returns:
- true if they should be included, false otherwise.
-
includeStartCodes
boolean includeStartCodes()
Indicates if start codes should be included (See SRX implementation notes).- Returns:
- true if they should be included, false otherwise.
-
reset
void reset()
Resets the options to their defaults, and the compiled rules to nothing.
-
segmentSubFlows
boolean segmentSubFlows()
Indicates if sub-flows must be segmented.- Returns:
- true if sub-flows must be segmented, false otherwise.
-
trimLeadingWhitespaces
boolean trimLeadingWhitespaces()
Indicates if leading white-spaces should be left outside the segments.- Returns:
- true if the leading white-spaces should be trimmed.
-
trimTrailingWhitespaces
boolean trimTrailingWhitespaces()
Indicates if trailing white-spaces should be left outside the segments.- Returns:
- true if the trailing white-spaces should be trimmed.
-
oneSegmentIncludesAll
boolean oneSegmentIncludesAll()
Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)- Returns:
- true if a text with a single segment should include the whole text.
-
treatIsolatedCodesAsWhitespace
boolean treatIsolatedCodesAsWhitespace()
Indicate if the segmenter should treat each isolated code as a single whitespace character (U+0020) when applying segmentation.- Returns:
- true if the segmenter should treat isolated codes as whitespace
-
setLanguage
void setLanguage(LocaleId locale)
Sets the locale used to apply the rules.- Parameters:
locale
- Code of the language to use to apply the rules.
-
setIncludeEndCodes
void setIncludeEndCodes(boolean includeEndCodes)
-
setIncludeIsolatedCodes
void setIncludeIsolatedCodes(boolean includeIsolatedCodes)
-
setIncludeStartCodes
void setIncludeStartCodes(boolean includeStartCodes)
-
setOneSegmentIncludesAll
void setOneSegmentIncludesAll(boolean oneSegmentIncludesAll)
-
setOptions
void setOptions(boolean segmentSubFlows, boolean includeStartCodes, boolean includeEndCodes, boolean includeIsolatedCodes, boolean oneSegmentIncludesAll, boolean trimLeadingWS, boolean trimTrailingWS)
Sets the options for this segmenter.- Parameters:
segmentSubFlows
- true to segment sub-flows, false to no segment them.includeStartCodes
- true to include start codes just before a break in the 'left' segment, false to put them in the next segment.includeEndCodes
- true to include end codes just before a break in the 'left' segment, false to put them in the next segment.includeIsolatedCodes
- true to include isolated codes just before a break in the 'left' segment, false to put them in the next segment.oneSegmentIncludesAll
- true to include everything in segments that are alone.trimLeadingWS
- true to trim leading white-spaces from the segments, false to keep them.trimTrailingWS
- true to trim trailing white-spaces from the segments, false to keep them.
-
setSegmentSubFlows
void setSegmentSubFlows(boolean segmentSubFlows)
-
setTrimCodes
void setTrimCodes(boolean trimCodes)
-
setTrimLeadingWS
void setTrimLeadingWS(boolean trimLeadingWS)
-
setTrimTrailingWS
void setTrimTrailingWS(boolean trimTrailingWS)
-
setTreatIsolatedCodesAsWhitespace
void setTreatIsolatedCodesAsWhitespace(boolean treatIsolatedCodesAsWhitespace)
-
-