Package net.sf.okapi.lib.segmentation
Class SRXDocument
- java.lang.Object
-
- net.sf.okapi.lib.segmentation.SRXDocument
-
public class SRXDocument extends Object
Provides facilities to load, save, and manage segmentation rules in SRX format. This class also implements several extensions to the standard SRX behavior.
-
-
Field Summary
Fields Modifier and Type Field Description static StringANYCODEMarker for INLINECODE_PATTERN in the given pattern.static StringDEFAULT_SRX_RULESstatic StringINLINECODE_PATTERNRepresents the pattern for an inline code (both special characters).static StringNOAUTOPlaced at the end of the 'after' expression, this marker indicates the given pattern should not have auto-insertion of AUTO_INLINECODES.
-
Constructor Summary
Constructors Constructor Description SRXDocument()Creates an empty SRX document.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidaddLanguageMap(LanguageMap langMap)Adds a language map to this document.voidaddLanguageRule(String name, ArrayList<Rule> langRule)Adds a language rule to this SRX document.booleancascade()Indicates if cascading must be applied when selecting the rules for a given language pattern.ISegmentercompileLanguageRules(LocaleId languageCode, ISegmenter existingSegmenter)Compiles the all language rules applicable for a given language code, and assign them to a segmenter.ISegmentercompileSingleLanguageRule(String ruleName, ISegmenter existingSegmenter)Compiles a single language rule group and assign it to a segmenter.StringgenerateRuleRegex(Rule rule)LinkedHashMap<String,ArrayList<Rule>>getAllLanguageRules()Gets a map of all the language rules in this document.ArrayList<LanguageMap>getAllLanguagesMaps()Gets the list of all the language maps in this document.StringgetComments()Gets the comments associated with this document.StringgetHeaderComments()Gets the comments associated with the header of this document.ArrayList<Rule>getLanguageRules(String ruleName)Gets the list of rules for a given <languagerule7gt; element.StringgetMaskRule()Gets the current pattern of the mask rule.StringgetSampleLanguage()Gets the current sample language code.StringgetSampleText()Gets the current sample text.StringgetVersion()Gets the version of this SRX document.StringgetWarning()Gets the last warning that was issued while loading a document.booleanhasWarning()Indicates if a warning was issued last time a document was read.booleanincludeEndCodes()Indicates if end codes should be included (See SRX implementation notes).booleanincludeIsolatedCodes()Indicates if isolated codes should be included (See SRX implementation notes).booleanincludeStartCodes()Indicates if start codes should be included (See SRX implementation notes).booleanisModified()Indicates if the document has been modified since the last load or save.voidloadRules(InputStream inputStream)Loads an SRX document from an input stream.voidloadRules(CharSequence data)Loads an SRX document from a CharSequence object.voidloadRules(String pathOrURL)Loads an SRX document from a file.booleanoneSegmentIncludesAll()Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)voidresetAll()Resets the document to its default empty initial state.voidsaveRules(String rulesPath, boolean saveExtensions, boolean saveNonValidInfo)Saves the current rules to an SRX rules document.StringsaveRulesToString(boolean saveExtensions, boolean saveNonValidInfo)Saves the current rules to an SRX string.booleansegmentSubFlows()Indicates if sub-flows must be segmented.voidsetCascade(boolean value)Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.voidsetComments(String text)Sets the comments for this document.voidsetHeaderComments(String text)Sets the comments for the header of this document.voidsetIncludeEndCodes(boolean value)Sets the indicator that tells if end codes should be included or not.voidsetIncludeIsolatedCodes(boolean value)Sets the indicator that tells if isolated codes should be included or not.voidsetIncludeStartCodes(boolean value)Sets the indicator that tells if start codes should be included or not.voidsetMaskRule(String pattern)Sets the pattern for the mask rule.voidsetModified(boolean value)Sets the flag indicating if the document has been modified since the last load or save.voidsetOneSegmentIncludesAll(boolean value)Sets the indicator that tells if when there is a single segment in a text it should include the whole text (no spaces or codes trim left/right) text.voidsetSampleLanguage(String value)Sets the sample language code.voidsetSampleText(String value)Sets the sample text.voidsetSegmentSubFlows(boolean value)Sets the flag indicating if sub-flows must be segmented.voidsetTestOnSelectedGroup(boolean value)Sets the indicator on how to apply rules for samples.voidsetTreatIsolatedCodesAsWhitespace(boolean value)Sets the indicator if this document should treat isolated codes as whitespace when matching SRX rules.voidsetTrimLeadingWhitespaces(boolean value)Sets the indicator that tells if leading white-spaces should be left outside the segments.voidsetTrimTrailingWhitespaces(boolean value)Sets the indicator that tells if trailing white-spaces should be left outside the segments.voidsetUseICU4JBreakRules(boolean value)Sets the indicator that tells if this document uses ICU4J BreakIterator rules.booleantestOnSelectedGroup()Indicates that, when sampling the rules, the sample should be computed using only a selected group of rules.booleantreatIsolatedCodesAsWhitespace()Indicates if this document should treat isolated codes as whitespace when matching SRX rules.booleantrimLeadingWhitespaces()Indicates if leading white-spaces should be left outside the segments.booleantrimTrailingWhitespaces()Indicates if trailing white-spaces should be left outside the segments.booleanuseIcu4JBreakRules()Indicates if this document uses ICU4J break rules.
-
-
-
Field Detail
-
DEFAULT_SRX_RULES
public static final String DEFAULT_SRX_RULES
- See Also:
- Constant Field Values
-
INLINECODE_PATTERN
public static final String INLINECODE_PATTERN
Represents the pattern for an inline code (both special characters).
-
ANYCODE
public static final String ANYCODE
Marker for INLINECODE_PATTERN in the given pattern. \Y+ = one or more codes, \Y* = zero, one or more codes, etc.- See Also:
- Constant Field Values
-
NOAUTO
public static final String NOAUTO
Placed at the end of the 'after' expression, this marker indicates the given pattern should not have auto-insertion of AUTO_INLINECODES.- See Also:
- Constant Field Values
-
-
Method Detail
-
getVersion
public String getVersion()
Gets the version of this SRX document.- Returns:
- the version of this SRX document.
-
hasWarning
public boolean hasWarning()
Indicates if a warning was issued last time a document was read.- Returns:
- true if a warning was issued, false otherwise.
-
getWarning
public String getWarning()
Gets the last warning that was issued while loading a document.- Returns:
- the text of the last warning issued, or an empty string.
-
getHeaderComments
public String getHeaderComments()
Gets the comments associated with the header of this document.- Returns:
- the comments for the header of this document, or null if there are none.
-
setHeaderComments
public void setHeaderComments(String text)
Sets the comments for the header of this document.- Parameters:
text- the new comments, use null or empty string for removing the comments.
-
getComments
public String getComments()
Gets the comments associated with this document.- Returns:
- the comments for this document, or null if there are none.
-
setComments
public void setComments(String text)
Sets the comments for this document.- Parameters:
text- the new comments, use null or empty string for removing the comments.
-
resetAll
public void resetAll()
Resets the document to its default empty initial state.
-
getAllLanguageRules
public LinkedHashMap<String,ArrayList<Rule>> getAllLanguageRules()
Gets a map of all the language rules in this document.- Returns:
- a map of all the language rules.
-
getLanguageRules
public ArrayList<Rule> getLanguageRules(String ruleName)
Gets the list of rules for a given <languagerule7gt; element.- Parameters:
ruleName- the name of the <languagerulegt; element to query.- Returns:
- the list of rules for a given <languagerulegt; element.
-
getAllLanguagesMaps
public ArrayList<LanguageMap> getAllLanguagesMaps()
Gets the list of all the language maps in this document.- Returns:
- the list of all the language maps.
-
segmentSubFlows
public boolean segmentSubFlows()
Indicates if sub-flows must be segmented.- Returns:
- true if sub-flows must be segmented, false otherwise.
-
setSegmentSubFlows
public void setSegmentSubFlows(boolean value)
Sets the flag indicating if sub-flows must be segmented.- Parameters:
value- true if sub-flows must be segmented, false otherwise.
-
cascade
public boolean cascade()
Indicates if cascading must be applied when selecting the rules for a given language pattern.- Returns:
- true if cascading must be applied, false otherwise.
-
setCascade
public void setCascade(boolean value)
Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.- Parameters:
value- true if cascading must be applied, false otherwise.
-
oneSegmentIncludesAll
public boolean oneSegmentIncludesAll()
Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)- Returns:
- true if a text with a single segment should include the whole text.
-
setOneSegmentIncludesAll
public void setOneSegmentIncludesAll(boolean value)
Sets the indicator that tells if when there is a single segment in a text it should include the whole text (no spaces or codes trim left/right) text.- Parameters:
value- true if a text with a single segment should include the whole text.
-
useIcu4JBreakRules
public boolean useIcu4JBreakRules()
Indicates if this document uses ICU4J break rules.- Returns:
- true if ICU4J break rules are used, false otherwise.
-
setUseICU4JBreakRules
public void setUseICU4JBreakRules(boolean value)
Sets the indicator that tells if this document uses ICU4J BreakIterator rules.BreakIteratorbreak positions are converted to SRX-like rules and used as default rules for all languages.- Parameters:
value- true if ICU4J rules should be used as defaults expression, false if no ICU4J rules should be used
-
treatIsolatedCodesAsWhitespace
public boolean treatIsolatedCodesAsWhitespace()
Indicates if this document should treat isolated codes as whitespace when matching SRX rules.- Returns:
- true if isolated codes should be treated as whitespace
-
setTreatIsolatedCodesAsWhitespace
public void setTreatIsolatedCodesAsWhitespace(boolean value)
Sets the indicator if this document should treat isolated codes as whitespace when matching SRX rules.- Parameters:
value- true if isolated codes should be treated as whitespace
-
trimLeadingWhitespaces
public boolean trimLeadingWhitespaces()
Indicates if leading white-spaces should be left outside the segments.- Returns:
- true if the leading white-spaces should be trimmed.
-
setTrimLeadingWhitespaces
public void setTrimLeadingWhitespaces(boolean value)
Sets the indicator that tells if leading white-spaces should be left outside the segments.- Parameters:
value- true if the leading white-spaces should be trimmed.
-
trimTrailingWhitespaces
public boolean trimTrailingWhitespaces()
Indicates if trailing white-spaces should be left outside the segments.- Returns:
- true if the trailing white-spaces should be trimmed.
-
setTrimTrailingWhitespaces
public void setTrimTrailingWhitespaces(boolean value)
Sets the indicator that tells if trailing white-spaces should be left outside the segments.- Parameters:
value- true if the trailing white-spaces should be trimmed.
-
includeStartCodes
public boolean includeStartCodes()
Indicates if start codes should be included (See SRX implementation notes).- Returns:
- true if start codes should be included, false otherwise.
-
setIncludeStartCodes
public void setIncludeStartCodes(boolean value)
Sets the indicator that tells if start codes should be included or not. (See SRX implementation notes).- Parameters:
value- true if start codes should be included, false otherwise.
-
includeEndCodes
public boolean includeEndCodes()
Indicates if end codes should be included (See SRX implementation notes).- Returns:
- true if end codes should be included, false otherwise.
-
setIncludeEndCodes
public void setIncludeEndCodes(boolean value)
Sets the indicator that tells if end codes should be included or not. (See SRX implementation notes).- Parameters:
value- true if end codes should be included, false otherwise.
-
includeIsolatedCodes
public boolean includeIsolatedCodes()
Indicates if isolated codes should be included (See SRX implementation notes).- Returns:
- true if isolated codes should be included, false otherwise.
-
setIncludeIsolatedCodes
public void setIncludeIsolatedCodes(boolean value)
Sets the indicator that tells if isolated codes should be included or not. (See SRX implementation notes).- Parameters:
value- true if isolated codes should be included, false otherwise.
-
getMaskRule
public String getMaskRule()
Gets the current pattern of the mask rule.- Returns:
- the current pattern of the mask rule.
-
setMaskRule
public void setMaskRule(String pattern)
Sets the pattern for the mask rule.- Parameters:
pattern- the new pattern to use for the mask rule.
-
getSampleText
public String getSampleText()
Gets the current sample text. This text is an example string that can be used to test the various rules. It can be handy to be able to save it along with the SRX document.- Returns:
- the sample text, or an empty string.
-
setSampleText
public void setSampleText(String value)
Sets the sample text.- Parameters:
value- the new sample text.
-
getSampleLanguage
public String getSampleLanguage()
Gets the current sample language code.- Returns:
- the current sample language code.
-
setSampleLanguage
public void setSampleLanguage(String value)
Sets the sample language code. Null or empty strings are changed to the default language.- Parameters:
value- the new sample language code.
-
testOnSelectedGroup
public boolean testOnSelectedGroup()
Indicates that, when sampling the rules, the sample should be computed using only a selected group of rules.- Returns:
- true to test using only a selected group of rules. False to test using all the rules matching a given language.
-
setTestOnSelectedGroup
public void setTestOnSelectedGroup(boolean value)
Sets the indicator on how to apply rules for samples.- Parameters:
value- true to test using only a selected group of rules. False to test using all the rules matching a given language.
-
isModified
public boolean isModified()
Indicates if the document has been modified since the last load or save.- Returns:
- true if the document have been modified, false otherwise.
-
setModified
public void setModified(boolean value)
Sets the flag indicating if the document has been modified since the last load or save. If you make change to the rules or language maps directly to the lists, make sure to set this flag to true.- Parameters:
value- true if the document has been changed, false otherwise.
-
addLanguageRule
public void addLanguageRule(String name, ArrayList<Rule> langRule)
Adds a language rule to this SRX document. If another language rule with the same name exists already it will be replaced by the new one, without warning.- Parameters:
name- name of the language rule to add.langRule- language rule object to add.
-
addLanguageMap
public void addLanguageMap(LanguageMap langMap)
Adds a language map to this document. The new map is added at the end of the one already there.- Parameters:
langMap- the language map object to add.
-
compileLanguageRules
public ISegmenter compileLanguageRules(LocaleId languageCode, ISegmenter existingSegmenter)
Compiles the all language rules applicable for a given language code, and assign them to a segmenter. This method applies the language code you specify to the language mappings currently available in the document and compile the rules when one or more language map is found. The matching is done in the order of the list of language maps and more than one can be selected ifcascade()is true.- Parameters:
languageCode- the language code. the value should be a BCP-47 value (e.g. "de", "fr-ca", etc.)existingSegmenter- optional existing SRXSegmenter object to re-use. Use null for not re-using anything.- Returns:
- the instance of the segmenter with the new compiled rules.
-
compileSingleLanguageRule
public ISegmenter compileSingleLanguageRule(String ruleName, ISegmenter existingSegmenter)
Compiles a single language rule group and assign it to a segmenter.- Parameters:
ruleName- the name of the rule group to apply.existingSegmenter- optional existing SRXSegmenter object to re-use. Use null for not re-using anything.- Returns:
- the instance of the segmenter with the new compiled rules.
-
loadRules
public void loadRules(CharSequence data)
Loads an SRX document from a CharSequence object. Calling this method resets all settings and rules to their default state and then populate them with the data stored in the document being loaded. The rules can be embedded inside another vocabulary.- Parameters:
data- the string containing the SRX document to load.
-
loadRules
public void loadRules(String pathOrURL)
Loads an SRX document from a file. Calling this method resets all settings and rules to their default state and then populate them with the data stored in the document being loaded. The rules can be embedded inside another vocabulary.For
SRXDocument.DEFAULT_SRX_RULES(the string"DEFAULT_SRX_RULES"in serialized parameters) this will load the (Okapi recommended).srxfile, embedded in the library jar.- Parameters:
pathOrURL- The full path or URL of the document to load.
-
loadRules
public void loadRules(InputStream inputStream)
Loads an SRX document from an input stream. Calling this method resets all settings and rules to their default state and then populate them with the data stored in the document being loaded. The rules can be embedded inside another vocabulary.- Parameters:
inputStream- the input stream to read from.
-
saveRulesToString
public String saveRulesToString(boolean saveExtensions, boolean saveNonValidInfo)
Saves the current rules to an SRX string.- Parameters:
saveExtensions- true to save Okapi SRX extensions, false otherwise.saveNonValidInfo- true to save non-SRX-valid attributes, false otherwise.- Returns:
- the string containing the saved SRX rules.
-
saveRules
public void saveRules(String rulesPath, boolean saveExtensions, boolean saveNonValidInfo)
Saves the current rules to an SRX rules document.- Parameters:
rulesPath- the full path of the file where to save the rules.saveExtensions- true to save Okapi SRX extensions, false otherwise.saveNonValidInfo- true to save non-SRX-valid attributes, false otherwise.
-
-