Class SRXSegmenter

    • Constructor Detail

      • SRXSegmenter

        public SRXSegmenter()
        Creates a new SRXSegmenter object.
    • Method Detail

      • reset

        public void reset()
        Description copied from interface: ISegmenter
        Resets the options to their defaults, and the compiled rules to nothing.
        Specified by:
        reset in interface ISegmenter
      • setOptions

        public void setOptions​(boolean segmentSubFlows,
                               boolean includeStartCodes,
                               boolean includeEndCodes,
                               boolean includeIsolatedCodes,
                               boolean oneSegmentIncludesAll,
                               boolean trimLeadingWS,
                               boolean trimTrailingWS,
                               boolean useJavaRegex,
                               boolean useIcu4JBreakRules,
                               boolean treatIsolatedCodesAsWhitespace)
        Sets the options for this segmenter.
        Parameters:
        segmentSubFlows - true to segment sub-flows, false to no segment them.
        includeStartCodes - true to include start codes just before a break in the 'left' segment, false to put them in the next segment.
        includeEndCodes - true to include end codes just before a break in the 'left' segment, false to put them in the next segment.
        includeIsolatedCodes - true to include isolated codes just before a break in the 'left' segment, false to put them in the next segment.
        oneSegmentIncludesAll - true to include everything in segments that are alone.
        trimLeadingWS - true to trim leading white-spaces from the segments, false to keep them.
        trimTrailingWS - true to trim trailing white-spaces from the segments, false to keep them.
        useJavaRegex - true if the rules are for the Java regular expression engine, false if they are for ICU.
        treatIsolatedCodesAsWhitespace - if true then the isolated code markers in codedText get converted to spaces, so that they don't get in the way of the rules. If false, the codes are simply removed.
      • setOptions

        public void setOptions​(boolean segmentSubFlows,
                               boolean includeStartCodes,
                               boolean includeEndCodes,
                               boolean includeIsolatedCodes,
                               boolean oneSegmentIncludesAll,
                               boolean trimLeadingWS,
                               boolean trimTrailingWS)
        Description copied from interface: ISegmenter
        Sets the options for this segmenter.
        Specified by:
        setOptions in interface ISegmenter
        Parameters:
        segmentSubFlows - true to segment sub-flows, false to no segment them.
        includeStartCodes - true to include start codes just before a break in the 'left' segment, false to put them in the next segment.
        includeEndCodes - true to include end codes just before a break in the 'left' segment, false to put them in the next segment.
        includeIsolatedCodes - true to include isolated codes just before a break in the 'left' segment, false to put them in the next segment.
        oneSegmentIncludesAll - true to include everything in segments that are alone.
        trimLeadingWS - true to trim leading white-spaces from the segments, false to keep them.
        trimTrailingWS - true to trim trailing white-spaces from the segments, false to keep them.
      • oneSegmentIncludesAll

        public boolean oneSegmentIncludesAll()
        Description copied from interface: ISegmenter
        Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)
        Specified by:
        oneSegmentIncludesAll in interface ISegmenter
        Returns:
        true if a text with a single segment should include the whole text.
      • segmentSubFlows

        public boolean segmentSubFlows()
        Description copied from interface: ISegmenter
        Indicates if sub-flows must be segmented.
        Specified by:
        segmentSubFlows in interface ISegmenter
        Returns:
        true if sub-flows must be segmented, false otherwise.
      • cascade

        public boolean cascade()
        Indicates if cascading must be applied when selecting the rules for a given language pattern.
        Returns:
        true if cascading must be applied, false otherwise.
      • trimLeadingWhitespaces

        public boolean trimLeadingWhitespaces()
        Description copied from interface: ISegmenter
        Indicates if leading white-spaces should be left outside the segments.
        Specified by:
        trimLeadingWhitespaces in interface ISegmenter
        Returns:
        true if the leading white-spaces should be trimmed.
      • trimTrailingWhitespaces

        public boolean trimTrailingWhitespaces()
        Description copied from interface: ISegmenter
        Indicates if trailing white-spaces should be left outside the segments.
        Specified by:
        trimTrailingWhitespaces in interface ISegmenter
        Returns:
        true if the trailing white-spaces should be trimmed.
      • useJavaRegex

        public boolean useJavaRegex()
        Indicates if this document has rules that are defined for the Java regular expression engine (vs ICU).
        Returns:
        true if the rules are for the Java regular expression engine, false if they are for ICU.
      • treatIsolatedCodesAsWhitespace

        public boolean treatIsolatedCodesAsWhitespace()
        Description copied from interface: ISegmenter
        Indicate if the segmenter should treat each isolated code as a single whitespace character (U+0020) when applying segmentation.
        Specified by:
        treatIsolatedCodesAsWhitespace in interface ISegmenter
        Returns:
        true if the segmenter should treat isolated codes as whitespace
      • setUseJavaRegex

        public void setUseJavaRegex​(boolean useJavaRegex)
        Sets the indicator that tells if this document has rules that are defined for the Java regular expression engine (vs ICU).
        Parameters:
        useJavaRegex - true if the rules should be treated as Java regular expression, false for ICU.
      • includeStartCodes

        public boolean includeStartCodes()
        Description copied from interface: ISegmenter
        Indicates if start codes should be included (See SRX implementation notes).
        Specified by:
        includeStartCodes in interface ISegmenter
        Returns:
        true if they should be included, false otherwise.
      • includeEndCodes

        public boolean includeEndCodes()
        Description copied from interface: ISegmenter
        Indicates if end codes should be included (See SRX implementation notes).
        Specified by:
        includeEndCodes in interface ISegmenter
        Returns:
        true if they should be included, false otherwise.
      • includeIsolatedCodes

        public boolean includeIsolatedCodes()
        Description copied from interface: ISegmenter
        Indicates if isolated codes should be included (See SRX implementation notes).
        Specified by:
        includeIsolatedCodes in interface ISegmenter
        Returns:
        true if they should be included, false otherwise.
      • computeSegments

        public int computeSegments​(String text)
        Description copied from interface: ISegmenter
        Calculate the segmentation of a given plain text string.
        Specified by:
        computeSegments in interface ISegmenter
        Parameters:
        text - plain text to segment.
        Returns:
        the number of segments calculated.
      • computeSegments

        public int computeSegments​(TextContainer container)
        Description copied from interface: ISegmenter
        Calculates the segmentation of a given TextContainer object. If the content is already segmented, it is un-segmented automatically before being processed.
        Specified by:
        computeSegments in interface ISegmenter
        Parameters:
        container - the object to segment.
        Returns:
        the number of segments calculated.
      • getNextSegmentRange

        public Range getNextSegmentRange​(TextContainer container)
        Description copied from interface: ISegmenter
        Compute the range of the next segment for a given TextContainer object. The next segment is searched from the first character after the last segment marker found in the container.
        Specified by:
        getNextSegmentRange in interface ISegmenter
        Parameters:
        container - the text container where to look for the next segment.
        Returns:
        a range corresponding to the start and end position of the found segment, or null if no more segments are found.
      • getSplitPositions

        public List<Integer> getSplitPositions()
        Description copied from interface: ISegmenter
        Gets the list of all the split positions in the text that was last segmented. You must call ISegmenter.computeSegments(TextContainer) or ISegmenter.computeSegments(String) before calling this method. A split position is the first character position of a new segment.

        IMPORTANT: The position returned here are the position WITHOUT taking in account any options for trimming or not leading and trailing white-spaces.

        Specified by:
        getSplitPositions in interface ISegmenter
        Returns:
        An array of integers where each value is a split position in the coded text that was segmented.
      • getLanguage

        public LocaleId getLanguage()
        Description copied from interface: ISegmenter
        Gets the language used to apply the rules.
        Specified by:
        getLanguage in interface ISegmenter
        Returns:
        the language code used to apply the rules, or null, if none has been specified.
      • setLanguage

        public void setLanguage​(LocaleId languageCode)
        Description copied from interface: ISegmenter
        Sets the locale used to apply the rules.
        Specified by:
        setLanguage in interface ISegmenter
        Parameters:
        languageCode - Code of the language to use to apply the rules.
      • setCascade

        protected void setCascade​(boolean value)
        Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.
        Parameters:
        value - true if cascading must be applied, false otherwise.
      • setMaskRule

        protected void setMaskRule​(String pattern)
        Sets the pattern for the mask rule.
        Parameters:
        pattern - the new pattern to use for the mask rule.
      • setSegmentSubFlows

        public void setSegmentSubFlows​(boolean segmentSubFlows)
        Specified by:
        setSegmentSubFlows in interface ISegmenter
      • setIncludeEndCodes

        public void setIncludeEndCodes​(boolean includeEndCodes)
        Specified by:
        setIncludeEndCodes in interface ISegmenter
      • setTrimLeadingWS

        public void setTrimLeadingWS​(boolean trimLeadingWS)
        Specified by:
        setTrimLeadingWS in interface ISegmenter
      • setTrimTrailingWS

        public void setTrimTrailingWS​(boolean trimTrailingWS)
        Specified by:
        setTrimTrailingWS in interface ISegmenter
      • setTrimCodes

        public void setTrimCodes​(boolean trimCodes)
        Specified by:
        setTrimCodes in interface ISegmenter