Class SRXDocument


  • public class SRXDocument
    extends Object
    Provides facilities to load, save, and manage segmentation rules in SRX format. This class also implements several extensions to the standard SRX behavior.
    • Field Detail

      • INLINECODE_PATTERN

        public static final String INLINECODE_PATTERN
        Represents the pattern for an inline code (both special characters).
      • ANYCODE

        public static final String ANYCODE
        Marker for INLINECODE_PATTERN in the given pattern. \Y+ = one or more codes, \Y* = zero, one or more codes, etc.
        See Also:
        Constant Field Values
      • NOAUTO

        public static final String NOAUTO
        Placed at the end of the 'after' expression, this marker indicates the given pattern should not have auto-insertion of AUTO_INLINECODES.
        See Also:
        Constant Field Values
    • Constructor Detail

      • SRXDocument

        public SRXDocument()
        Creates an empty SRX document.
    • Method Detail

      • getVersion

        public String getVersion()
        Gets the version of this SRX document.
        Returns:
        the version of this SRX document.
      • hasWarning

        public boolean hasWarning()
        Indicates if a warning was issued last time a document was read.
        Returns:
        true if a warning was issued, false otherwise.
      • getWarning

        public String getWarning()
        Gets the last warning that was issued while loading a document.
        Returns:
        the text of the last warning issued, or an empty string.
      • getHeaderComments

        public String getHeaderComments()
        Gets the comments associated with the header of this document.
        Returns:
        the comments for the header of this document, or null if there are none.
      • setHeaderComments

        public void setHeaderComments​(String text)
        Sets the comments for the header of this document.
        Parameters:
        text - the new comments, use null or empty string for removing the comments.
      • getComments

        public String getComments()
        Gets the comments associated with this document.
        Returns:
        the comments for this document, or null if there are none.
      • setComments

        public void setComments​(String text)
        Sets the comments for this document.
        Parameters:
        text - the new comments, use null or empty string for removing the comments.
      • resetAll

        public void resetAll()
        Resets the document to its default empty initial state.
      • getAllLanguageRules

        public LinkedHashMap<String,​ArrayList<Rule>> getAllLanguageRules()
        Gets a map of all the language rules in this document.
        Returns:
        a map of all the language rules.
      • getLanguageRules

        public ArrayList<Rule> getLanguageRules​(String ruleName)
        Gets the list of rules for a given <languagerule7gt; element.
        Parameters:
        ruleName - the name of the <languagerulegt; element to query.
        Returns:
        the list of rules for a given <languagerulegt; element.
      • getAllLanguagesMaps

        public ArrayList<LanguageMap> getAllLanguagesMaps()
        Gets the list of all the language maps in this document.
        Returns:
        the list of all the language maps.
      • segmentSubFlows

        public boolean segmentSubFlows()
        Indicates if sub-flows must be segmented.
        Returns:
        true if sub-flows must be segmented, false otherwise.
      • setSegmentSubFlows

        public void setSegmentSubFlows​(boolean value)
        Sets the flag indicating if sub-flows must be segmented.
        Parameters:
        value - true if sub-flows must be segmented, false otherwise.
      • cascade

        public boolean cascade()
        Indicates if cascading must be applied when selecting the rules for a given language pattern.
        Returns:
        true if cascading must be applied, false otherwise.
      • setCascade

        public void setCascade​(boolean value)
        Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.
        Parameters:
        value - true if cascading must be applied, false otherwise.
      • oneSegmentIncludesAll

        public boolean oneSegmentIncludesAll()
        Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)
        Returns:
        true if a text with a single segment should include the whole text.
      • setOneSegmentIncludesAll

        public void setOneSegmentIncludesAll​(boolean value)
        Sets the indicator that tells if when there is a single segment in a text it should include the whole text (no spaces or codes trim left/right) text.
        Parameters:
        value - true if a text with a single segment should include the whole text.
      • useIcu4JBreakRules

        public boolean useIcu4JBreakRules()
        Indicates if this document uses ICU4J break rules.
        Returns:
        true if ICU4J break rules are used, false otherwise.
      • setUseICU4JBreakRules

        public void setUseICU4JBreakRules​(boolean value)
        Sets the indicator that tells if this document uses ICU4J BreakIterator rules. BreakIterator break positions are converted to SRX-like rules and used as default rules for all languages.
        Parameters:
        value - true if ICU4J rules should be used as defaults expression, false if no ICU4J rules should be used
      • treatIsolatedCodesAsWhitespace

        public boolean treatIsolatedCodesAsWhitespace()
        Indicates if this document should treat isolated codes as whitespace when matching SRX rules.
        Returns:
        true if isolated codes should be treated as whitespace
      • setTreatIsolatedCodesAsWhitespace

        public void setTreatIsolatedCodesAsWhitespace​(boolean value)
        Sets the indicator if this document should treat isolated codes as whitespace when matching SRX rules.
        Parameters:
        value - true if isolated codes should be treated as whitespace
      • trimLeadingWhitespaces

        public boolean trimLeadingWhitespaces()
        Indicates if leading white-spaces should be left outside the segments.
        Returns:
        true if the leading white-spaces should be trimmed.
      • setTrimLeadingWhitespaces

        public void setTrimLeadingWhitespaces​(boolean value)
        Sets the indicator that tells if leading white-spaces should be left outside the segments.
        Parameters:
        value - true if the leading white-spaces should be trimmed.
      • trimTrailingWhitespaces

        public boolean trimTrailingWhitespaces()
        Indicates if trailing white-spaces should be left outside the segments.
        Returns:
        true if the trailing white-spaces should be trimmed.
      • setTrimTrailingWhitespaces

        public void setTrimTrailingWhitespaces​(boolean value)
        Sets the indicator that tells if trailing white-spaces should be left outside the segments.
        Parameters:
        value - true if the trailing white-spaces should be trimmed.
      • includeStartCodes

        public boolean includeStartCodes()
        Indicates if start codes should be included (See SRX implementation notes).
        Returns:
        true if start codes should be included, false otherwise.
      • setIncludeStartCodes

        public void setIncludeStartCodes​(boolean value)
        Sets the indicator that tells if start codes should be included or not. (See SRX implementation notes).
        Parameters:
        value - true if start codes should be included, false otherwise.
      • includeEndCodes

        public boolean includeEndCodes()
        Indicates if end codes should be included (See SRX implementation notes).
        Returns:
        true if end codes should be included, false otherwise.
      • setIncludeEndCodes

        public void setIncludeEndCodes​(boolean value)
        Sets the indicator that tells if end codes should be included or not. (See SRX implementation notes).
        Parameters:
        value - true if end codes should be included, false otherwise.
      • includeIsolatedCodes

        public boolean includeIsolatedCodes()
        Indicates if isolated codes should be included (See SRX implementation notes).
        Returns:
        true if isolated codes should be included, false otherwise.
      • setIncludeIsolatedCodes

        public void setIncludeIsolatedCodes​(boolean value)
        Sets the indicator that tells if isolated codes should be included or not. (See SRX implementation notes).
        Parameters:
        value - true if isolated codes should be included, false otherwise.
      • getMaskRule

        public String getMaskRule()
        Gets the current pattern of the mask rule.
        Returns:
        the current pattern of the mask rule.
      • setMaskRule

        public void setMaskRule​(String pattern)
        Sets the pattern for the mask rule.
        Parameters:
        pattern - the new pattern to use for the mask rule.
      • getSampleText

        public String getSampleText()
        Gets the current sample text. This text is an example string that can be used to test the various rules. It can be handy to be able to save it along with the SRX document.
        Returns:
        the sample text, or an empty string.
      • setSampleText

        public void setSampleText​(String value)
        Sets the sample text.
        Parameters:
        value - the new sample text.
      • getSampleLanguage

        public String getSampleLanguage()
        Gets the current sample language code.
        Returns:
        the current sample language code.
      • setSampleLanguage

        public void setSampleLanguage​(String value)
        Sets the sample language code. Null or empty strings are changed to the default language.
        Parameters:
        value - the new sample language code.
      • testOnSelectedGroup

        public boolean testOnSelectedGroup()
        Indicates that, when sampling the rules, the sample should be computed using only a selected group of rules.
        Returns:
        true to test using only a selected group of rules. False to test using all the rules matching a given language.
      • setTestOnSelectedGroup

        public void setTestOnSelectedGroup​(boolean value)
        Sets the indicator on how to apply rules for samples.
        Parameters:
        value - true to test using only a selected group of rules. False to test using all the rules matching a given language.
      • isModified

        public boolean isModified()
        Indicates if the document has been modified since the last load or save.
        Returns:
        true if the document have been modified, false otherwise.
      • setModified

        public void setModified​(boolean value)
        Sets the flag indicating if the document has been modified since the last load or save. If you make change to the rules or language maps directly to the lists, make sure to set this flag to true.
        Parameters:
        value - true if the document has been changed, false otherwise.
      • addLanguageRule

        public void addLanguageRule​(String name,
                                    ArrayList<Rule> langRule)
        Adds a language rule to this SRX document. If another language rule with the same name exists already it will be replaced by the new one, without warning.
        Parameters:
        name - name of the language rule to add.
        langRule - language rule object to add.
      • addLanguageMap

        public void addLanguageMap​(LanguageMap langMap)
        Adds a language map to this document. The new map is added at the end of the one already there.
        Parameters:
        langMap - the language map object to add.
      • compileLanguageRules

        public ISegmenter compileLanguageRules​(LocaleId languageCode,
                                               ISegmenter existingSegmenter)
        Compiles the all language rules applicable for a given language code, and assign them to a segmenter. This method applies the language code you specify to the language mappings currently available in the document and compile the rules when one or more language map is found. The matching is done in the order of the list of language maps and more than one can be selected if cascade() is true.
        Parameters:
        languageCode - the language code. the value should be a BCP-47 value (e.g. "de", "fr-ca", etc.)
        existingSegmenter - optional existing SRXSegmenter object to re-use. Use null for not re-using anything.
        Returns:
        the instance of the segmenter with the new compiled rules.
      • compileSingleLanguageRule

        public ISegmenter compileSingleLanguageRule​(String ruleName,
                                                    ISegmenter existingSegmenter)
        Compiles a single language rule group and assign it to a segmenter.
        Parameters:
        ruleName - the name of the rule group to apply.
        existingSegmenter - optional existing SRXSegmenter object to re-use. Use null for not re-using anything.
        Returns:
        the instance of the segmenter with the new compiled rules.
      • generateRuleRegex

        public String generateRuleRegex​(Rule rule)
      • loadRules

        public void loadRules​(CharSequence data)
        Loads an SRX document from a CharSequence object. Calling this method resets all settings and rules to their default state and then populate them with the data stored in the document being loaded. The rules can be embedded inside another vocabulary.
        Parameters:
        data - the string containing the SRX document to load.
      • loadRules

        public void loadRules​(String pathOrURL)
        Loads an SRX document from a file. Calling this method resets all settings and rules to their default state and then populate them with the data stored in the document being loaded. The rules can be embedded inside another vocabulary.

        For SRXDocument.DEFAULT_SRX_RULES (the string "DEFAULT_SRX_RULES" in serialized parameters) this will load the (Okapi recommended) .srx file, embedded in the library jar.

        Parameters:
        pathOrURL - The full path or URL of the document to load.
      • loadRules

        public void loadRules​(InputStream inputStream)
        Loads an SRX document from an input stream. Calling this method resets all settings and rules to their default state and then populate them with the data stored in the document being loaded. The rules can be embedded inside another vocabulary.
        Parameters:
        inputStream - the input stream to read from.
      • saveRulesToString

        public String saveRulesToString​(boolean saveExtensions,
                                        boolean saveNonValidInfo)
        Saves the current rules to an SRX string.
        Parameters:
        saveExtensions - true to save Okapi SRX extensions, false otherwise.
        saveNonValidInfo - true to save non-SRX-valid attributes, false otherwise.
        Returns:
        the string containing the saved SRX rules.
      • saveRules

        public void saveRules​(String rulesPath,
                              boolean saveExtensions,
                              boolean saveNonValidInfo)
        Saves the current rules to an SRX rules document.
        Parameters:
        rulesPath - the full path of the file where to save the rules.
        saveExtensions - true to save Okapi SRX extensions, false otherwise.
        saveNonValidInfo - true to save non-SRX-valid attributes, false otherwise.