Interface ISegmenter

  • All Known Implementing Classes:
    SRXSegmenter

    public interface ISegmenter
    Common methods to provide segmentation facility to extracted content.
    • Method Detail

      • computeSegments

        int computeSegments​(String text)
        Calculate the segmentation of a given plain text string.
        Parameters:
        text - plain text to segment.
        Returns:
        the number of segments calculated.
      • computeSegments

        int computeSegments​(TextContainer container)
        Calculates the segmentation of a given TextContainer object. If the content is already segmented, it is un-segmented automatically before being processed.
        Parameters:
        container - the object to segment.
        Returns:
        the number of segments calculated.
      • getNextSegmentRange

        Range getNextSegmentRange​(TextContainer container)
        Compute the range of the next segment for a given TextContainer object. The next segment is searched from the first character after the last segment marker found in the container.
        Parameters:
        container - the text container where to look for the next segment.
        Returns:
        a range corresponding to the start and end position of the found segment, or null if no more segments are found.
      • getSplitPositions

        List<Integer> getSplitPositions()
        Gets the list of all the split positions in the text that was last segmented. You must call computeSegments(TextContainer) or computeSegments(String) before calling this method. A split position is the first character position of a new segment.

        IMPORTANT: The position returned here are the position WITHOUT taking in account any options for trimming or not leading and trailing white-spaces.

        Returns:
        An array of integers where each value is a split position in the coded text that was segmented.
      • getRanges

        List<Range> getRanges()
        Gets the list off all segments ranges calculated when calling computeSegments(String), or computeSegments(TextContainer).
        Returns:
        the list of all segments ranges. each range is stored in a Range object where start is the start and end the end of the range. Returns null if no ranges have been defined yet.
      • getLanguage

        LocaleId getLanguage()
        Gets the language used to apply the rules.
        Returns:
        the language code used to apply the rules, or null, if none has been specified.
      • includeEndCodes

        boolean includeEndCodes()
        Indicates if end codes should be included (See SRX implementation notes).
        Returns:
        true if they should be included, false otherwise.
      • includeIsolatedCodes

        boolean includeIsolatedCodes()
        Indicates if isolated codes should be included (See SRX implementation notes).
        Returns:
        true if they should be included, false otherwise.
      • includeStartCodes

        boolean includeStartCodes()
        Indicates if start codes should be included (See SRX implementation notes).
        Returns:
        true if they should be included, false otherwise.
      • reset

        void reset()
        Resets the options to their defaults, and the compiled rules to nothing.
      • segmentSubFlows

        boolean segmentSubFlows()
        Indicates if sub-flows must be segmented.
        Returns:
        true if sub-flows must be segmented, false otherwise.
      • trimLeadingWhitespaces

        boolean trimLeadingWhitespaces()
        Indicates if leading white-spaces should be left outside the segments.
        Returns:
        true if the leading white-spaces should be trimmed.
      • trimTrailingWhitespaces

        boolean trimTrailingWhitespaces()
        Indicates if trailing white-spaces should be left outside the segments.
        Returns:
        true if the trailing white-spaces should be trimmed.
      • oneSegmentIncludesAll

        boolean oneSegmentIncludesAll()
        Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)
        Returns:
        true if a text with a single segment should include the whole text.
      • treatIsolatedCodesAsWhitespace

        boolean treatIsolatedCodesAsWhitespace()
        Indicate if the segmenter should treat each isolated code as a single whitespace character (U+0020) when applying segmentation.
        Returns:
        true if the segmenter should treat isolated codes as whitespace
      • setLanguage

        void setLanguage​(LocaleId locale)
        Sets the locale used to apply the rules.
        Parameters:
        locale - Code of the language to use to apply the rules.
      • setIncludeEndCodes

        void setIncludeEndCodes​(boolean includeEndCodes)
      • setIncludeIsolatedCodes

        void setIncludeIsolatedCodes​(boolean includeIsolatedCodes)
      • setIncludeStartCodes

        void setIncludeStartCodes​(boolean includeStartCodes)
      • setOneSegmentIncludesAll

        void setOneSegmentIncludesAll​(boolean oneSegmentIncludesAll)
      • setOptions

        void setOptions​(boolean segmentSubFlows,
                        boolean includeStartCodes,
                        boolean includeEndCodes,
                        boolean includeIsolatedCodes,
                        boolean oneSegmentIncludesAll,
                        boolean trimLeadingWS,
                        boolean trimTrailingWS)
        Sets the options for this segmenter.
        Parameters:
        segmentSubFlows - true to segment sub-flows, false to no segment them.
        includeStartCodes - true to include start codes just before a break in the 'left' segment, false to put them in the next segment.
        includeEndCodes - true to include end codes just before a break in the 'left' segment, false to put them in the next segment.
        includeIsolatedCodes - true to include isolated codes just before a break in the 'left' segment, false to put them in the next segment.
        oneSegmentIncludesAll - true to include everything in segments that are alone.
        trimLeadingWS - true to trim leading white-spaces from the segments, false to keep them.
        trimTrailingWS - true to trim trailing white-spaces from the segments, false to keep them.
      • setSegmentSubFlows

        void setSegmentSubFlows​(boolean segmentSubFlows)
      • setTrimCodes

        void setTrimCodes​(boolean trimCodes)
      • setTrimLeadingWS

        void setTrimLeadingWS​(boolean trimLeadingWS)
      • setTrimTrailingWS

        void setTrimTrailingWS​(boolean trimTrailingWS)
      • setTreatIsolatedCodesAsWhitespace

        void setTreatIsolatedCodesAsWhitespace​(boolean treatIsolatedCodesAsWhitespace)