Package net.sf.okapi.lib.segmentation
Class SRXDocument
- java.lang.Object
-
- net.sf.okapi.lib.segmentation.SRXDocument
-
public class SRXDocument extends Object
Provides facilities to load, save, and manage segmentation rules in SRX format. This class also implements several extensions to the standard SRX behavior.
-
-
Field Summary
Fields Modifier and Type Field Description static String
ANYCODE
Marker for INLINECODE_PATTERN in the given pattern.static String
DEFAULT_SRX_RULES
static String
INLINECODE_PATTERN
Represents the pattern for an inline code (both special characters).static String
NOAUTO
Placed at the end of the 'after' expression, this marker indicates the given pattern should not have auto-insertion of AUTO_INLINECODES.
-
Constructor Summary
Constructors Constructor Description SRXDocument()
Creates an empty SRX document.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addLanguageMap(LanguageMap langMap)
Adds a language map to this document.void
addLanguageRule(String name, ArrayList<Rule> langRule)
Adds a language rule to this SRX document.boolean
cascade()
Indicates if cascading must be applied when selecting the rules for a given language pattern.ISegmenter
compileLanguageRules(LocaleId languageCode, ISegmenter existingSegmenter)
Compiles the all language rules applicable for a given language code, and assign them to a segmenter.ISegmenter
compileSingleLanguageRule(String ruleName, ISegmenter existingSegmenter)
Compiles a single language rule group and assign it to a segmenter.String
generateRuleRegex(Rule rule)
LinkedHashMap<String,ArrayList<Rule>>
getAllLanguageRules()
Gets a map of all the language rules in this document.ArrayList<LanguageMap>
getAllLanguagesMaps()
Gets the list of all the language maps in this document.String
getComments()
Gets the comments associated with this document.String
getHeaderComments()
Gets the comments associated with the header of this document.ArrayList<Rule>
getLanguageRules(String ruleName)
Gets the list of rules for a given <languagerule7gt; element.String
getMaskRule()
Gets the current pattern of the mask rule.String
getSampleLanguage()
Gets the current sample language code.String
getSampleText()
Gets the current sample text.String
getVersion()
Gets the version of this SRX document.String
getWarning()
Gets the last warning that was issued while loading a document.boolean
hasWarning()
Indicates if a warning was issued last time a document was read.boolean
includeEndCodes()
Indicates if end codes should be included (See SRX implementation notes).boolean
includeIsolatedCodes()
Indicates if isolated codes should be included (See SRX implementation notes).boolean
includeStartCodes()
Indicates if start codes should be included (See SRX implementation notes).boolean
isModified()
Indicates if the document has been modified since the last load or save.void
loadRules(InputStream inputStream)
Loads an SRX document from an input stream.void
loadRules(CharSequence data)
Loads an SRX document from a CharSequence object.void
loadRules(String pathOrURL)
Loads an SRX document from a file.boolean
oneSegmentIncludesAll()
Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)void
resetAll()
Resets the document to its default empty initial state.void
saveRules(String rulesPath, boolean saveExtensions, boolean saveNonValidInfo)
Saves the current rules to an SRX rules document.String
saveRulesToString(boolean saveExtensions, boolean saveNonValidInfo)
Saves the current rules to an SRX string.boolean
segmentSubFlows()
Indicates if sub-flows must be segmented.void
setCascade(boolean value)
Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.void
setComments(String text)
Sets the comments for this document.void
setHeaderComments(String text)
Sets the comments for the header of this document.void
setIncludeEndCodes(boolean value)
Sets the indicator that tells if end codes should be included or not.void
setIncludeIsolatedCodes(boolean value)
Sets the indicator that tells if isolated codes should be included or not.void
setIncludeStartCodes(boolean value)
Sets the indicator that tells if start codes should be included or not.void
setMaskRule(String pattern)
Sets the pattern for the mask rule.void
setModified(boolean value)
Sets the flag indicating if the document has been modified since the last load or save.void
setOneSegmentIncludesAll(boolean value)
Sets the indicator that tells if when there is a single segment in a text it should include the whole text (no spaces or codes trim left/right) text.void
setSampleLanguage(String value)
Sets the sample language code.void
setSampleText(String value)
Sets the sample text.void
setSegmentSubFlows(boolean value)
Sets the flag indicating if sub-flows must be segmented.void
setTestOnSelectedGroup(boolean value)
Sets the indicator on how to apply rules for samples.void
setTreatIsolatedCodesAsWhitespace(boolean value)
Sets the indicator if this document should treat isolated codes as whitespace when matching SRX rules.void
setTrimLeadingWhitespaces(boolean value)
Sets the indicator that tells if leading white-spaces should be left outside the segments.void
setTrimTrailingWhitespaces(boolean value)
Sets the indicator that tells if trailing white-spaces should be left outside the segments.void
setUseICU4JBreakRules(boolean value)
Sets the indicator that tells if this document uses ICU4J BreakIterator rules.boolean
testOnSelectedGroup()
Indicates that, when sampling the rules, the sample should be computed using only a selected group of rules.boolean
treatIsolatedCodesAsWhitespace()
Indicates if this document should treat isolated codes as whitespace when matching SRX rules.boolean
trimLeadingWhitespaces()
Indicates if leading white-spaces should be left outside the segments.boolean
trimTrailingWhitespaces()
Indicates if trailing white-spaces should be left outside the segments.boolean
useIcu4JBreakRules()
Indicates if this document uses ICU4J break rules.
-
-
-
Field Detail
-
DEFAULT_SRX_RULES
public static final String DEFAULT_SRX_RULES
- See Also:
- Constant Field Values
-
INLINECODE_PATTERN
public static final String INLINECODE_PATTERN
Represents the pattern for an inline code (both special characters).
-
ANYCODE
public static final String ANYCODE
Marker for INLINECODE_PATTERN in the given pattern. \Y+ = one or more codes, \Y* = zero, one or more codes, etc.- See Also:
- Constant Field Values
-
NOAUTO
public static final String NOAUTO
Placed at the end of the 'after' expression, this marker indicates the given pattern should not have auto-insertion of AUTO_INLINECODES.- See Also:
- Constant Field Values
-
-
Method Detail
-
getVersion
public String getVersion()
Gets the version of this SRX document.- Returns:
- the version of this SRX document.
-
hasWarning
public boolean hasWarning()
Indicates if a warning was issued last time a document was read.- Returns:
- true if a warning was issued, false otherwise.
-
getWarning
public String getWarning()
Gets the last warning that was issued while loading a document.- Returns:
- the text of the last warning issued, or an empty string.
-
getHeaderComments
public String getHeaderComments()
Gets the comments associated with the header of this document.- Returns:
- the comments for the header of this document, or null if there are none.
-
setHeaderComments
public void setHeaderComments(String text)
Sets the comments for the header of this document.- Parameters:
text
- the new comments, use null or empty string for removing the comments.
-
getComments
public String getComments()
Gets the comments associated with this document.- Returns:
- the comments for this document, or null if there are none.
-
setComments
public void setComments(String text)
Sets the comments for this document.- Parameters:
text
- the new comments, use null or empty string for removing the comments.
-
resetAll
public void resetAll()
Resets the document to its default empty initial state.
-
getAllLanguageRules
public LinkedHashMap<String,ArrayList<Rule>> getAllLanguageRules()
Gets a map of all the language rules in this document.- Returns:
- a map of all the language rules.
-
getLanguageRules
public ArrayList<Rule> getLanguageRules(String ruleName)
Gets the list of rules for a given <languagerule7gt; element.- Parameters:
ruleName
- the name of the <languagerulegt; element to query.- Returns:
- the list of rules for a given <languagerulegt; element.
-
getAllLanguagesMaps
public ArrayList<LanguageMap> getAllLanguagesMaps()
Gets the list of all the language maps in this document.- Returns:
- the list of all the language maps.
-
segmentSubFlows
public boolean segmentSubFlows()
Indicates if sub-flows must be segmented.- Returns:
- true if sub-flows must be segmented, false otherwise.
-
setSegmentSubFlows
public void setSegmentSubFlows(boolean value)
Sets the flag indicating if sub-flows must be segmented.- Parameters:
value
- true if sub-flows must be segmented, false otherwise.
-
cascade
public boolean cascade()
Indicates if cascading must be applied when selecting the rules for a given language pattern.- Returns:
- true if cascading must be applied, false otherwise.
-
setCascade
public void setCascade(boolean value)
Sets the flag indicating if cascading must be applied when selecting the rules for a given language pattern.- Parameters:
value
- true if cascading must be applied, false otherwise.
-
oneSegmentIncludesAll
public boolean oneSegmentIncludesAll()
Indicates if, when there is a single segment in a text, it should include the whole text (no spaces or codes trim left/right)- Returns:
- true if a text with a single segment should include the whole text.
-
setOneSegmentIncludesAll
public void setOneSegmentIncludesAll(boolean value)
Sets the indicator that tells if when there is a single segment in a text it should include the whole text (no spaces or codes trim left/right) text.- Parameters:
value
- true if a text with a single segment should include the whole text.
-
useIcu4JBreakRules
public boolean useIcu4JBreakRules()
Indicates if this document uses ICU4J break rules.- Returns:
- true if ICU4J break rules are used, false otherwise.
-
setUseICU4JBreakRules
public void setUseICU4JBreakRules(boolean value)
Sets the indicator that tells if this document uses ICU4J BreakIterator rules.BreakIterator
break positions are converted to SRX-like rules and used as default rules for all languages.- Parameters:
value
- true if ICU4J rules should be used as defaults expression, false if no ICU4J rules should be used
-
treatIsolatedCodesAsWhitespace
public boolean treatIsolatedCodesAsWhitespace()
Indicates if this document should treat isolated codes as whitespace when matching SRX rules.- Returns:
- true if isolated codes should be treated as whitespace
-
setTreatIsolatedCodesAsWhitespace
public void setTreatIsolatedCodesAsWhitespace(boolean value)
Sets the indicator if this document should treat isolated codes as whitespace when matching SRX rules.- Parameters:
value
- true if isolated codes should be treated as whitespace
-
trimLeadingWhitespaces
public boolean trimLeadingWhitespaces()
Indicates if leading white-spaces should be left outside the segments.- Returns:
- true if the leading white-spaces should be trimmed.
-
setTrimLeadingWhitespaces
public void setTrimLeadingWhitespaces(boolean value)
Sets the indicator that tells if leading white-spaces should be left outside the segments.- Parameters:
value
- true if the leading white-spaces should be trimmed.
-
trimTrailingWhitespaces
public boolean trimTrailingWhitespaces()
Indicates if trailing white-spaces should be left outside the segments.- Returns:
- true if the trailing white-spaces should be trimmed.
-
setTrimTrailingWhitespaces
public void setTrimTrailingWhitespaces(boolean value)
Sets the indicator that tells if trailing white-spaces should be left outside the segments.- Parameters:
value
- true if the trailing white-spaces should be trimmed.
-
includeStartCodes
public boolean includeStartCodes()
Indicates if start codes should be included (See SRX implementation notes).- Returns:
- true if start codes should be included, false otherwise.
-
setIncludeStartCodes
public void setIncludeStartCodes(boolean value)
Sets the indicator that tells if start codes should be included or not. (See SRX implementation notes).- Parameters:
value
- true if start codes should be included, false otherwise.
-
includeEndCodes
public boolean includeEndCodes()
Indicates if end codes should be included (See SRX implementation notes).- Returns:
- true if end codes should be included, false otherwise.
-
setIncludeEndCodes
public void setIncludeEndCodes(boolean value)
Sets the indicator that tells if end codes should be included or not. (See SRX implementation notes).- Parameters:
value
- true if end codes should be included, false otherwise.
-
includeIsolatedCodes
public boolean includeIsolatedCodes()
Indicates if isolated codes should be included (See SRX implementation notes).- Returns:
- true if isolated codes should be included, false otherwise.
-
setIncludeIsolatedCodes
public void setIncludeIsolatedCodes(boolean value)
Sets the indicator that tells if isolated codes should be included or not. (See SRX implementation notes).- Parameters:
value
- true if isolated codes should be included, false otherwise.
-
getMaskRule
public String getMaskRule()
Gets the current pattern of the mask rule.- Returns:
- the current pattern of the mask rule.
-
setMaskRule
public void setMaskRule(String pattern)
Sets the pattern for the mask rule.- Parameters:
pattern
- the new pattern to use for the mask rule.
-
getSampleText
public String getSampleText()
Gets the current sample text. This text is an example string that can be used to test the various rules. It can be handy to be able to save it along with the SRX document.- Returns:
- the sample text, or an empty string.
-
setSampleText
public void setSampleText(String value)
Sets the sample text.- Parameters:
value
- the new sample text.
-
getSampleLanguage
public String getSampleLanguage()
Gets the current sample language code.- Returns:
- the current sample language code.
-
setSampleLanguage
public void setSampleLanguage(String value)
Sets the sample language code. Null or empty strings are changed to the default language.- Parameters:
value
- the new sample language code.
-
testOnSelectedGroup
public boolean testOnSelectedGroup()
Indicates that, when sampling the rules, the sample should be computed using only a selected group of rules.- Returns:
- true to test using only a selected group of rules. False to test using all the rules matching a given language.
-
setTestOnSelectedGroup
public void setTestOnSelectedGroup(boolean value)
Sets the indicator on how to apply rules for samples.- Parameters:
value
- true to test using only a selected group of rules. False to test using all the rules matching a given language.
-
isModified
public boolean isModified()
Indicates if the document has been modified since the last load or save.- Returns:
- true if the document have been modified, false otherwise.
-
setModified
public void setModified(boolean value)
Sets the flag indicating if the document has been modified since the last load or save. If you make change to the rules or language maps directly to the lists, make sure to set this flag to true.- Parameters:
value
- true if the document has been changed, false otherwise.
-
addLanguageRule
public void addLanguageRule(String name, ArrayList<Rule> langRule)
Adds a language rule to this SRX document. If another language rule with the same name exists already it will be replaced by the new one, without warning.- Parameters:
name
- name of the language rule to add.langRule
- language rule object to add.
-
addLanguageMap
public void addLanguageMap(LanguageMap langMap)
Adds a language map to this document. The new map is added at the end of the one already there.- Parameters:
langMap
- the language map object to add.
-
compileLanguageRules
public ISegmenter compileLanguageRules(LocaleId languageCode, ISegmenter existingSegmenter)
Compiles the all language rules applicable for a given language code, and assign them to a segmenter. This method applies the language code you specify to the language mappings currently available in the document and compile the rules when one or more language map is found. The matching is done in the order of the list of language maps and more than one can be selected ifcascade()
is true.- Parameters:
languageCode
- the language code. the value should be a BCP-47 value (e.g. "de", "fr-ca", etc.)existingSegmenter
- optional existing SRXSegmenter object to re-use. Use null for not re-using anything.- Returns:
- the instance of the segmenter with the new compiled rules.
-
compileSingleLanguageRule
public ISegmenter compileSingleLanguageRule(String ruleName, ISegmenter existingSegmenter)
Compiles a single language rule group and assign it to a segmenter.- Parameters:
ruleName
- the name of the rule group to apply.existingSegmenter
- optional existing SRXSegmenter object to re-use. Use null for not re-using anything.- Returns:
- the instance of the segmenter with the new compiled rules.
-
loadRules
public void loadRules(CharSequence data)
Loads an SRX document from a CharSequence object. Calling this method resets all settings and rules to their default state and then populate them with the data stored in the document being loaded. The rules can be embedded inside another vocabulary.- Parameters:
data
- the string containing the SRX document to load.
-
loadRules
public void loadRules(String pathOrURL)
Loads an SRX document from a file. Calling this method resets all settings and rules to their default state and then populate them with the data stored in the document being loaded. The rules can be embedded inside another vocabulary.For
SRXDocument.DEFAULT_SRX_RULES
(the string"DEFAULT_SRX_RULES"
in serialized parameters) this will load the (Okapi recommended).srx
file, embedded in the library jar.- Parameters:
pathOrURL
- The full path or URL of the document to load.
-
loadRules
public void loadRules(InputStream inputStream)
Loads an SRX document from an input stream. Calling this method resets all settings and rules to their default state and then populate them with the data stored in the document being loaded. The rules can be embedded inside another vocabulary.- Parameters:
inputStream
- the input stream to read from.
-
saveRulesToString
public String saveRulesToString(boolean saveExtensions, boolean saveNonValidInfo)
Saves the current rules to an SRX string.- Parameters:
saveExtensions
- true to save Okapi SRX extensions, false otherwise.saveNonValidInfo
- true to save non-SRX-valid attributes, false otherwise.- Returns:
- the string containing the saved SRX rules.
-
saveRules
public void saveRules(String rulesPath, boolean saveExtensions, boolean saveNonValidInfo)
Saves the current rules to an SRX rules document.- Parameters:
rulesPath
- the full path of the file where to save the rules.saveExtensions
- true to save Okapi SRX extensions, false otherwise.saveNonValidInfo
- true to save non-SRX-valid attributes, false otherwise.
-
-