Package net.sf.okapi.filters.html
Class HtmlFilter
- java.lang.Object
-
- net.sf.okapi.common.filters.AbstractFilter
-
- net.sf.okapi.filters.abstractmarkup.AbstractMarkupFilter
-
- net.sf.okapi.filters.html.HtmlFilter
-
- All Implemented Interfaces:
AutoCloseable,Iterator<Event>,IFilter
public class HtmlFilter extends AbstractMarkupFilter
-
-
Field Summary
-
Fields inherited from interface net.sf.okapi.common.filters.IFilter
SUB_FILTER
-
-
Constructor Summary
Constructors Constructor Description HtmlFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidclose()Close the filter and all used resources.protected PropertyTextUnitPlaceholdercreatePropertyTextUnitPlaceholder(PropertyTextUnitPlaceholder.PlaceholderAccessType type, String name, String value, net.htmlparser.jericho.Tag tag, net.htmlparser.jericho.Attribute attribute)ISkeletonWritercreateSkeletonWriter()Default case.protected TextFragment.TagTypedetermineTagType(net.htmlparser.jericho.Tag tag)Filter specific method for determiningTextFragment.TagTypeprotected voidendFilter()End the current filter processing and send theEndingEventprotected TaggedFilterConfigurationgetConfig()Get the currentTaggedFilterConfiguration.ParametersgetParameters()Gets the current parameters for this filter.ExtractionRuleState.ExtractionRulegetRuleTypeFromStartTag(net.htmlparser.jericho.EndTag endTag, EnumSet<TaggedFilterConfiguration.RULE_TYPE> ruleTypes)protected voidhandleEndTag(net.htmlparser.jericho.EndTag endTag)Handle end tags, including empty tags.protected StringnormalizeAttributeName(String attrName, String attrValue, net.htmlparser.jericho.Tag tag)Some attributes names are converted to Okapi standards such as HTML charset to "encoding" and lang to "language"voidopen(RawDocument input, boolean generateSkeleton)Start a newIFilterusing the suppliedRawDocument.protected voidpreProcess(net.htmlparser.jericho.Segment segment)Do any handling needed before the current Segment is processed.voidsetParameters(IParameters params)Sets new parameters for this filter.voidsetParametersFromFile(File config)Initialize filter parameters from a Java File.voidsetParametersFromString(String config)Initialize filter parameters from a String.voidsetParametersFromURL(URL config)Initialize filter parameters from a URL.protected voidstartFilter()Initialize rule state and parser.protected voidupdateEndTagRuleState(net.htmlparser.jericho.EndTag endTag, ExtractionRuleState.ExtractionRule rule)protected voidupdateStartTagRuleState(net.htmlparser.jericho.StartTag startTag, ExtractionRuleState.ExtractionRule rule)-
Methods inherited from class net.sf.okapi.filters.abstractmarkup.AbstractMarkupFilter
addCodeToCurrentTextUnit, addCodeToCurrentTextUnit, addToDocumentPart, addToTextUnit, addToTextUnit, addToTextUnit, addToTextUnit, canStartNewTextUnit, createEventBuilder, createPropertyTextUnitPlaceholders, detectEncoding, disambiguateElementRuleTypes, endDocumentPart, endGroup, endTextUnit, getCurrentDocName, getEventBuilder, getMainAttributeRule, getMainElementRule, getParsedHeader, getRuleState, getTextUnitId, handleCdataSection, handleCharacterEntity, handleComment, handleDocTypeDeclaration, handleDocumentPart, handleNumericEntity, handleProcessingInstruction, handleServerCommon, handleServerCommonEscaped, handleStartTag, handleText, handleXmlDeclaration, hasNext, isBOM, isDocumentEncoding, isInline, isInsideTextRun, isMatchedTag, isPreserveWhitespace, isUtf8Bom, isUtf8Encoding, isWhiteSpace, next, open, peekTempEvent, postProcessTextUnit, setCurrentDocName, setDocumentPartId, setMimeType, setPreserveWhitespace, setTextUnitMimeType, setTextUnitName, setTextUnitType, startDocumentPart, startGroup, startGroup, startTextUnit, startTextUnit, startTextUnit, startTextUnit
-
Methods inherited from class net.sf.okapi.common.filters.AbstractFilter
addConfiguration, addConfiguration, addConfiguration, addConfigurations, cancel, createEndFilterEvent, createFilterWriter, createStartFilterEvent, findConfiguration, getConfiguration, getConfigurations, getDisplayName, getDocumentId, getDocumentName, getEncoderManager, getEncoding, getFilterConfigurationMapper, getMimeType, getName, getNewlineType, getParameters, getParametersClassName, getParentId, getSrcLoc, getTrgLoc, isCanceled, isGenerateSkeleton, isMultilingual, removeConfiguration, setDisplayName, setDocumentName, setEncoding, setFilterConfigurationMapper, setGenerateSkeleton, setMultilingual, setName, setNewlineType, setOptions, setParentId, setSrcLoc, setTrgLoc
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface java.util.Iterator
forEachRemaining, remove
-
-
-
-
Method Detail
-
createSkeletonWriter
public ISkeletonWriter createSkeletonWriter()
Description copied from class:AbstractFilterDefault case. Override if needed.- Specified by:
createSkeletonWriterin interfaceIFilter- Overrides:
createSkeletonWriterin classAbstractFilter- Returns:
- new instance of
GenericSkeletonWriter
-
open
public void open(RawDocument input, boolean generateSkeleton)
Description copied from class:AbstractMarkupFilterStart a newIFilterusing the suppliedRawDocument.- Specified by:
openin interfaceIFilter- Overrides:
openin classAbstractMarkupFilter- Parameters:
input- - input to theIFilter(can be aCharSequence,URIorInputStream)generateSkeleton- - true if theIFiltershould store non-translatble blocks (aka skeleton), false otherwise.
-
close
public void close()
Description copied from class:AbstractMarkupFilterClose the filter and all used resources.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceIFilter- Overrides:
closein classAbstractMarkupFilter
-
startFilter
protected void startFilter()
Initialize rule state and parser. Called before processing of each input.- Overrides:
startFilterin classAbstractMarkupFilter
-
endFilter
protected void endFilter()
End the current filter processing and send theEndingEvent- Overrides:
endFilterin classAbstractMarkupFilter
-
preProcess
protected void preProcess(net.htmlparser.jericho.Segment segment)
Description copied from class:AbstractMarkupFilterDo any handling needed before the current Segment is processed. Default is to do nothing.- Overrides:
preProcessin classAbstractMarkupFilter
-
updateStartTagRuleState
protected void updateStartTagRuleState(net.htmlparser.jericho.StartTag startTag, ExtractionRuleState.ExtractionRule rule)- Overrides:
updateStartTagRuleStatein classAbstractMarkupFilter
-
getRuleTypeFromStartTag
public ExtractionRuleState.ExtractionRule getRuleTypeFromStartTag(net.htmlparser.jericho.EndTag endTag, EnumSet<TaggedFilterConfiguration.RULE_TYPE> ruleTypes)
- Overrides:
getRuleTypeFromStartTagin classAbstractMarkupFilter
-
updateEndTagRuleState
protected void updateEndTagRuleState(net.htmlparser.jericho.EndTag endTag, ExtractionRuleState.ExtractionRule rule)- Overrides:
updateEndTagRuleStatein classAbstractMarkupFilter
-
handleEndTag
protected void handleEndTag(net.htmlparser.jericho.EndTag endTag)
Description copied from class:AbstractMarkupFilterHandle end tags, including empty tags.- Overrides:
handleEndTagin classAbstractMarkupFilter
-
createPropertyTextUnitPlaceholder
protected PropertyTextUnitPlaceholder createPropertyTextUnitPlaceholder(PropertyTextUnitPlaceholder.PlaceholderAccessType type, String name, String value, net.htmlparser.jericho.Tag tag, net.htmlparser.jericho.Attribute attribute)
Description copied from class:AbstractMarkupFilter- Overrides:
createPropertyTextUnitPlaceholderin classAbstractMarkupFilter- Parameters:
type- -PropertyTextUnitPlaceholder.PlaceholderAccessTypeis one of TRANSLATABLE, READ_ONLY_PROPERTY, WRITABLE_PROPERTYname- - attribute namevalue- - attribute valuetag- - JerichoTagwhich contains the attributeattribute- - attribute as a JerichoAttribute- Returns:
- a
PropertyTextUnitPlaceholderrepresenting the attribute
-
normalizeAttributeName
protected String normalizeAttributeName(String attrName, String attrValue, net.htmlparser.jericho.Tag tag)
Description copied from class:AbstractMarkupFilterSome attributes names are converted to Okapi standards such as HTML charset to "encoding" and lang to "language"- Specified by:
normalizeAttributeNamein classAbstractMarkupFilter- Parameters:
attrName- - the attribute nameattrValue- - the attribute valuetag- - the JerichoTagthat contains the attribute- Returns:
- the attribute name after it as passe through the normalization rules
-
getConfig
protected TaggedFilterConfiguration getConfig()
Description copied from class:AbstractMarkupFilterGet the currentTaggedFilterConfiguration. A TaggedFilterConfiguration is the result of reading in a YAML configuration file and converting it into Java Objects.- Specified by:
getConfigin classAbstractMarkupFilter- Returns:
- a
TaggedFilterConfiguration
-
setParameters
public void setParameters(IParameters params)
Description copied from interface:IFilterSets new parameters for this filter.- Specified by:
setParametersin interfaceIFilter- Overrides:
setParametersin classAbstractFilter- Parameters:
params- The new parameters to use.
-
getParameters
public Parameters getParameters()
Description copied from interface:IFilterGets the current parameters for this filter.- Specified by:
getParametersin interfaceIFilter- Overrides:
getParametersin classAbstractFilter- Returns:
- The current parameters for this filter, or
DefaultParametersif this filter has no parameters.
-
setParametersFromURL
public void setParametersFromURL(URL config)
Initialize filter parameters from a URL.- Parameters:
config-
-
setParametersFromFile
public void setParametersFromFile(File config)
Initialize filter parameters from a Java File.- Parameters:
config-
-
setParametersFromString
public void setParametersFromString(String config)
Initialize filter parameters from a String.- Parameters:
config-
-
determineTagType
protected TextFragment.TagType determineTagType(net.htmlparser.jericho.Tag tag)
Description copied from class:AbstractMarkupFilterFilter specific method for determiningTextFragment.TagType- Overrides:
determineTagTypein classAbstractMarkupFilter- Parameters:
tag- JerichoTagstart or end tag- Returns:
- PLACEHOLDER, OPEN, CLOSED
TextFragment.TagType
-
-