Class AbstractMarkupFilter
- java.lang.Object
-
- net.sf.okapi.common.filters.AbstractFilter
-
- net.sf.okapi.filters.abstractmarkup.AbstractMarkupFilter
-
- All Implemented Interfaces:
AutoCloseable,Iterator<Event>,IFilter
- Direct Known Subclasses:
HtmlFilter,XmlStreamFilter
public abstract class AbstractMarkupFilter extends AbstractFilter
Abstract class useful for creating anIFilteraround the Jericho parser. Jericho can parse non-wellformed HTML, XHTML, XML and various server side scripting languages such as PHP, Mason, Perl (all configurable from Jericho). AbstractMarkupFilter takes care of the parser initialization and provides default handlers for each token type returned by the parser.Handling of translatable text, inline tags, translatable and read-only attributes are configurable through a user defined YAML file. See the Okapi HtmlFilter with defaultConfiguration.yml and OpenXml filters for examples.
-
-
Field Summary
-
Fields inherited from interface net.sf.okapi.common.filters.IFilter
SUB_FILTER
-
-
Constructor Summary
Constructors Constructor Description AbstractMarkupFilter()Default constructor forAbstractMarkupFilterusing defaultAbstractMarkupEventBuilder
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected voidaddCodeToCurrentTextUnit(net.htmlparser.jericho.Tag tag)protected voidaddCodeToCurrentTextUnit(net.htmlparser.jericho.Tag tag, boolean endCodeNow)protected voidaddToDocumentPart(String part)protected voidaddToTextUnit(String text)protected voidaddToTextUnit(Code code)protected voidaddToTextUnit(Code code, boolean endCodeNow)protected voidaddToTextUnit(Code code, boolean endCodeNow, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)protected booleancanStartNewTextUnit()voidclose()Close the filter and all used resources.protected AbstractMarkupEventBuildercreateEventBuilder()Delayed initialization of theEventBuilder.protected PropertyTextUnitPlaceholdercreatePropertyTextUnitPlaceholder(PropertyTextUnitPlaceholder.PlaceholderAccessType type, String name, String value, net.htmlparser.jericho.Tag tag, net.htmlparser.jericho.Attribute attribute)protected List<PropertyTextUnitPlaceholder>createPropertyTextUnitPlaceholders(net.htmlparser.jericho.StartTag startTag)For the given JerichoStartTagparse out all the actionable attributes and and store them asPropertyTextUnitPlaceholder.protected StringdetectEncoding(RawDocument input)protected TextFragment.TagTypedetermineTagType(net.htmlparser.jericho.Tag tag)Filter specific method for determiningTextFragment.TagTypeExtractionRuleState.ExtractionRuledisambiguateElementRuleTypes(net.htmlparser.jericho.Tag tag, EnumSet<TaggedFilterConfiguration.RULE_TYPE> ruleTypes)protected voidendDocumentPart()protected voidendFilter()protected voidendGroup(GenericSkeleton endMarker)protected voidendTextUnit(GenericSkeleton endMarker)protected abstract TaggedFilterConfigurationgetConfig()Get the currentTaggedFilterConfiguration.protected StringgetCurrentDocName()AbstractMarkupEventBuildergetEventBuilder()ExtractionRuleState.ExtractionRulegetMainAttributeRule(net.htmlparser.jericho.StartTag tag, String attributeName, Map<String,String> attributes)ExtractionRuleState.ExtractionRulegetMainElementRule(net.htmlparser.jericho.Tag tag)protected net.htmlparser.jericho.SourcegetParsedHeader(InputStream inputStream)protected ExtractionRuleStategetRuleState()ExtractionRuleState.ExtractionRulegetRuleTypeFromStartTag(net.htmlparser.jericho.EndTag endTag, EnumSet<TaggedFilterConfiguration.RULE_TYPE> ruleTypes)protected longgetTextUnitId()protected voidhandleCdataSection(net.htmlparser.jericho.Tag tag)Handle CDATA sections.protected voidhandleCharacterEntity(net.htmlparser.jericho.CharacterEntityReference entity)Handle all numeric entities.protected voidhandleComment(net.htmlparser.jericho.Tag tag)Handle comments.protected voidhandleDocTypeDeclaration(net.htmlparser.jericho.Tag tag)Handle the XML doc type declaration (DTD).protected voidhandleDocumentPart(net.htmlparser.jericho.Tag tag)Handle anything else not classified by Jericho.protected voidhandleEndTag(net.htmlparser.jericho.EndTag endTag)Handle end tags, including empty tags.protected voidhandleNumericEntity(net.htmlparser.jericho.NumericCharacterReference entity)Handle all Character entities.protected voidhandleProcessingInstruction(net.htmlparser.jericho.Tag tag)Handle processing instructions.protected voidhandleServerCommon(net.htmlparser.jericho.Tag tag)Handle any recognized server tags (i.e., PHP, Mason etc.)protected voidhandleServerCommonEscaped(net.htmlparser.jericho.Tag tag)Handle any recognized escaped server tags.protected voidhandleStartTag(net.htmlparser.jericho.StartTag startTag)Handle start tags.protected voidhandleText(CharSequence text)Handle all text (PCDATA).protected voidhandleXmlDeclaration(net.htmlparser.jericho.Tag tag)Handle an XML declaration.booleanhasNext()Indicates if there is an event to process.protected booleanisBOM()Does the input have a BOM?protected booleanisDocumentEncoding()Does this document have a document encoding specified?booleanisInline(net.htmlparser.jericho.Segment segment)based on rule state and segment type determine if we are still inside a text runprotected booleanisInsideTextRun()booleanisMatchedTag(ExtractionRuleState.ExtractionRule currentState, net.htmlparser.jericho.EndTag endTag)protected booleanisPreserveWhitespace()protected booleanisUtf8Bom()Does the input have a UTF-8 Byte Order Mark?protected booleanisUtf8Encoding()Is the input encoded as UTF-8?protected booleanisWhiteSpace(CharSequence text)Eventnext()Queue up Jericho tokens until we can build an OkapiEventand return it.protected abstract StringnormalizeAttributeName(String attrName, String attrValue, net.htmlparser.jericho.Tag tag)Some attributes names are converted to Okapi standards such as HTML charset to "encoding" and lang to "language"voidopen(RawDocument input)Start a newIFilterusing the suppliedRawDocument.voidopen(RawDocument input, boolean generateSkeleton)Start a newIFilterusing the suppliedRawDocument.protected EventpeekTempEvent()protected voidpostProcessTextUnit(ITextUnit textUnit)protected voidpreProcess(net.htmlparser.jericho.Segment segment)Do any handling needed before the current Segment is processed.protected voidsetCurrentDocName(String currentDocName)protected voidsetDocumentPartId(long id)voidsetMimeType(String mimeType)Sets the input document mime type.protected voidsetPreserveWhitespace(boolean preserveWhitespace)protected voidsetTextUnitMimeType(String mimeType)protected voidsetTextUnitName(String name)protected voidsetTextUnitType(String type)protected voidstartDocumentPart(String part, String name, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)protected voidstartFilter()Initialize the filter for every input and send theStartDocumentEventprotected voidstartGroup(GenericSkeleton startMarker, String commonTagType)protected voidstartGroup(GenericSkeleton startMarker, String commonTagType, LocaleId locale, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)protected voidstartTextUnit()protected voidstartTextUnit(String text)protected voidstartTextUnit(GenericSkeleton startMarker)protected voidstartTextUnit(GenericSkeleton startMarker, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)protected voidupdateEndTagRuleState(net.htmlparser.jericho.EndTag endTag, ExtractionRuleState.ExtractionRule rule)protected voidupdateStartTagRuleState(net.htmlparser.jericho.StartTag startTag, ExtractionRuleState.ExtractionRule rule)-
Methods inherited from class net.sf.okapi.common.filters.AbstractFilter
addConfiguration, addConfiguration, addConfiguration, addConfigurations, cancel, createEndFilterEvent, createFilterWriter, createSkeletonWriter, createStartFilterEvent, findConfiguration, getConfiguration, getConfigurations, getDisplayName, getDocumentId, getDocumentName, getEncoderManager, getEncoding, getFilterConfigurationMapper, getMimeType, getName, getNewlineType, getParameters, getParameters, getParametersClassName, getParentId, getSrcLoc, getTrgLoc, isCanceled, isGenerateSkeleton, isMultilingual, removeConfiguration, setDisplayName, setDocumentName, setEncoding, setFilterConfigurationMapper, setGenerateSkeleton, setMultilingual, setName, setNewlineType, setOptions, setParameters, setParentId, setSrcLoc, setTrgLoc
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface java.util.Iterator
forEachRemaining, remove
-
-
-
-
Constructor Detail
-
AbstractMarkupFilter
public AbstractMarkupFilter()
Default constructor forAbstractMarkupFilterusing defaultAbstractMarkupEventBuilder
-
-
Method Detail
-
getConfig
protected abstract TaggedFilterConfiguration getConfig()
Get the currentTaggedFilterConfiguration. A TaggedFilterConfiguration is the result of reading in a YAML configuration file and converting it into Java Objects.- Returns:
- a
TaggedFilterConfiguration
-
close
public void close()
Close the filter and all used resources.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceIFilter- Overrides:
closein classAbstractFilter
-
getParsedHeader
protected net.htmlparser.jericho.Source getParsedHeader(InputStream inputStream)
-
detectEncoding
protected String detectEncoding(RawDocument input)
-
open
public void open(RawDocument input)
Start a newIFilterusing the suppliedRawDocument.- Parameters:
input- - input to theIFilter(can be aCharSequence,URIorInputStream)
-
open
public void open(RawDocument input, boolean generateSkeleton)
Start a newIFilterusing the suppliedRawDocument.- Specified by:
openin interfaceIFilter- Overrides:
openin classAbstractFilter- Parameters:
input- - input to theIFilter(can be aCharSequence,URIorInputStream)generateSkeleton- - true if theIFiltershould store non-translatble blocks (aka skeleton), false otherwise.- Throws:
OkapiBadFilterInputExceptionOkapiIOException
-
hasNext
public boolean hasNext()
Description copied from interface:IFilterIndicates if there is an event to process.Implementer Note: The caller must be able to call this method several times without changing state.
- Returns:
- True if there is at least one event to process, false if not.
-
next
public Event next()
Queue up Jericho tokens until we can build an OkapiEventand return it.- Returns:
- The next event available or null if there are no events.
-
createEventBuilder
protected AbstractMarkupEventBuilder createEventBuilder()
Delayed initialization of theEventBuilder. This will be called when the filter is initialized if an EventBuilder was not previously passed to the constructor.- Returns:
-
startFilter
protected void startFilter()
Initialize the filter for every input and send theStartDocumentEvent
-
endFilter
protected void endFilter()
-
getMainAttributeRule
public ExtractionRuleState.ExtractionRule getMainAttributeRule(net.htmlparser.jericho.StartTag tag, String attributeName, Map<String,String> attributes)
-
disambiguateElementRuleTypes
public ExtractionRuleState.ExtractionRule disambiguateElementRuleTypes(net.htmlparser.jericho.Tag tag, EnumSet<TaggedFilterConfiguration.RULE_TYPE> ruleTypes)
-
getRuleTypeFromStartTag
public ExtractionRuleState.ExtractionRule getRuleTypeFromStartTag(net.htmlparser.jericho.EndTag endTag, EnumSet<TaggedFilterConfiguration.RULE_TYPE> ruleTypes)
-
getMainElementRule
public ExtractionRuleState.ExtractionRule getMainElementRule(net.htmlparser.jericho.Tag tag)
-
isInline
public boolean isInline(net.htmlparser.jericho.Segment segment)
based on rule state and segment type determine if we are still inside a text run- Parameters:
segment-- Returns:
-
preProcess
protected void preProcess(net.htmlparser.jericho.Segment segment)
Do any handling needed before the current Segment is processed. Default is to do nothing.- Parameters:
segment-
-
postProcessTextUnit
protected void postProcessTextUnit(ITextUnit textUnit)
-
handleServerCommonEscaped
protected void handleServerCommonEscaped(net.htmlparser.jericho.Tag tag)
Handle any recognized escaped server tags.- Parameters:
tag-
-
handleServerCommon
protected void handleServerCommon(net.htmlparser.jericho.Tag tag)
Handle any recognized server tags (i.e., PHP, Mason etc.)- Parameters:
tag-
-
handleXmlDeclaration
protected void handleXmlDeclaration(net.htmlparser.jericho.Tag tag)
Handle an XML declaration.- Parameters:
tag-
-
handleDocTypeDeclaration
protected void handleDocTypeDeclaration(net.htmlparser.jericho.Tag tag)
Handle the XML doc type declaration (DTD).- Parameters:
tag-
-
handleProcessingInstruction
protected void handleProcessingInstruction(net.htmlparser.jericho.Tag tag)
Handle processing instructions.- Parameters:
tag-
-
handleComment
protected void handleComment(net.htmlparser.jericho.Tag tag)
Handle comments.- Parameters:
tag-
-
handleCdataSection
protected void handleCdataSection(net.htmlparser.jericho.Tag tag)
Handle CDATA sections.- Parameters:
tag-
-
handleText
protected void handleText(CharSequence text)
Handle all text (PCDATA).- Parameters:
text-
-
isWhiteSpace
protected boolean isWhiteSpace(CharSequence text)
-
handleNumericEntity
protected void handleNumericEntity(net.htmlparser.jericho.NumericCharacterReference entity)
Handle all Character entities. Default implementation converts entity to Unicode character.- Parameters:
entity- - the character entity
-
handleCharacterEntity
protected void handleCharacterEntity(net.htmlparser.jericho.CharacterEntityReference entity)
Handle all numeric entities. Default implementation converts entity to Unicode character.- Parameters:
entity- - the numeric entity
-
handleStartTag
protected void handleStartTag(net.htmlparser.jericho.StartTag startTag)
Handle start tags.- Parameters:
startTag-
-
updateStartTagRuleState
protected void updateStartTagRuleState(net.htmlparser.jericho.StartTag startTag, ExtractionRuleState.ExtractionRule rule)
-
updateEndTagRuleState
protected void updateEndTagRuleState(net.htmlparser.jericho.EndTag endTag, ExtractionRuleState.ExtractionRule rule)
-
isMatchedTag
public boolean isMatchedTag(ExtractionRuleState.ExtractionRule currentState, net.htmlparser.jericho.EndTag endTag)
-
handleEndTag
protected void handleEndTag(net.htmlparser.jericho.EndTag endTag)
Handle end tags, including empty tags.- Parameters:
endTag-
-
handleDocumentPart
protected void handleDocumentPart(net.htmlparser.jericho.Tag tag)
Handle anything else not classified by Jericho.- Parameters:
tag-
-
normalizeAttributeName
protected abstract String normalizeAttributeName(String attrName, String attrValue, net.htmlparser.jericho.Tag tag)
Some attributes names are converted to Okapi standards such as HTML charset to "encoding" and lang to "language"- Parameters:
attrName- - the attribute nameattrValue- - the attribute valuetag- - the JerichoTagthat contains the attribute- Returns:
- the attribute name after it as passe through the normalization rules
-
addCodeToCurrentTextUnit
protected void addCodeToCurrentTextUnit(net.htmlparser.jericho.Tag tag)
- Parameters:
tag- - the JerichoTagthat is converted to a OkpaiCode
-
determineTagType
protected TextFragment.TagType determineTagType(net.htmlparser.jericho.Tag tag)
Filter specific method for determiningTextFragment.TagType- Parameters:
tag- JerichoTagstart or end tag- Returns:
- PLACEHOLDER, OPEN, CLOSED
TextFragment.TagType
-
addCodeToCurrentTextUnit
protected void addCodeToCurrentTextUnit(net.htmlparser.jericho.Tag tag, boolean endCodeNow)- Parameters:
tag- - the JerichoTagthat is converted to a OkpaiCodeendCodeNow- - do we end the code now or delay so we can add more content to the code?
-
createPropertyTextUnitPlaceholders
protected List<PropertyTextUnitPlaceholder> createPropertyTextUnitPlaceholders(net.htmlparser.jericho.StartTag startTag)
For the given JerichoStartTagparse out all the actionable attributes and and store them asPropertyTextUnitPlaceholder.PropertyTextUnitPlaceholder.PlaceholderAccessTypeare set based on the filter configuration for each attribute. for the attribute name and value.- Parameters:
startTag- - JerichoStartTag- Returns:
- all actionable (translatable, writable or read-only) attributes found in the
StartTag
-
createPropertyTextUnitPlaceholder
protected PropertyTextUnitPlaceholder createPropertyTextUnitPlaceholder(PropertyTextUnitPlaceholder.PlaceholderAccessType type, String name, String value, net.htmlparser.jericho.Tag tag, net.htmlparser.jericho.Attribute attribute)
- Parameters:
type- -PropertyTextUnitPlaceholder.PlaceholderAccessTypeis one of TRANSLATABLE, READ_ONLY_PROPERTY, WRITABLE_PROPERTYname- - attribute namevalue- - attribute valuetag- - JerichoTagwhich contains the attributeattribute- - attribute as a JerichoAttribute- Returns:
- a
PropertyTextUnitPlaceholderrepresenting the attribute
-
isUtf8Encoding
protected boolean isUtf8Encoding()
Is the input encoded as UTF-8?- Overrides:
isUtf8Encodingin classAbstractFilter- Returns:
- true if the document is in utf8 encoding.
-
isUtf8Bom
protected boolean isUtf8Bom()
Does the input have a UTF-8 Byte Order Mark?- Overrides:
isUtf8Bomin classAbstractFilter- Returns:
- true if the document has a utf-8 byte order mark.
-
isBOM
protected boolean isBOM()
Does the input have a BOM?- Returns:
- true if the document has a BOM.
-
isDocumentEncoding
protected boolean isDocumentEncoding()
Does this document have a document encoding specified?- Returns:
- true if has meta tag with encoding, false otherwise
-
isPreserveWhitespace
protected boolean isPreserveWhitespace()
- Returns:
- the preserveWhitespace boolean.
-
setPreserveWhitespace
protected void setPreserveWhitespace(boolean preserveWhitespace)
-
addToDocumentPart
protected void addToDocumentPart(String part)
-
addToTextUnit
protected void addToTextUnit(String text)
-
startTextUnit
protected void startTextUnit(String text)
-
setTextUnitName
protected void setTextUnitName(String name)
-
setTextUnitType
protected void setTextUnitType(String type)
-
setCurrentDocName
protected void setCurrentDocName(String currentDocName)
-
getCurrentDocName
protected String getCurrentDocName()
-
canStartNewTextUnit
protected boolean canStartNewTextUnit()
-
isInsideTextRun
protected boolean isInsideTextRun()
-
addToTextUnit
protected void addToTextUnit(Code code, boolean endCodeNow)
-
addToTextUnit
protected void addToTextUnit(Code code)
-
addToTextUnit
protected void addToTextUnit(Code code, boolean endCodeNow, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)
-
endDocumentPart
protected void endDocumentPart()
-
startDocumentPart
protected void startDocumentPart(String part, String name, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)
-
startGroup
protected void startGroup(GenericSkeleton startMarker, String commonTagType)
-
startGroup
protected void startGroup(GenericSkeleton startMarker, String commonTagType, LocaleId locale, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)
-
startTextUnit
protected void startTextUnit(GenericSkeleton startMarker)
-
startTextUnit
protected void startTextUnit(GenericSkeleton startMarker, List<PropertyTextUnitPlaceholder> propertyTextUnitPlaceholders)
-
endTextUnit
protected void endTextUnit(GenericSkeleton endMarker)
-
endGroup
protected void endGroup(GenericSkeleton endMarker)
-
startTextUnit
protected void startTextUnit()
-
getTextUnitId
protected long getTextUnitId()
-
setTextUnitMimeType
protected void setTextUnitMimeType(String mimeType)
-
setDocumentPartId
protected void setDocumentPartId(long id)
-
peekTempEvent
protected Event peekTempEvent()
-
getRuleState
protected ExtractionRuleState getRuleState()
-
getEventBuilder
public AbstractMarkupEventBuilder getEventBuilder()
- Returns:
- the eventBuilder
-
setMimeType
public void setMimeType(String mimeType)
Sets the input document mime type.- Overrides:
setMimeTypein classAbstractFilter- Parameters:
mimeType- the new mime type
-
-