Class AbstractMarkupFilter

  • All Implemented Interfaces:
    AutoCloseable, Iterator<Event>, IFilter
    Direct Known Subclasses:
    HtmlFilter, XmlStreamFilter

    public abstract class AbstractMarkupFilter
    extends AbstractFilter
    Abstract class useful for creating an IFilter around the Jericho parser. Jericho can parse non-wellformed HTML, XHTML, XML and various server side scripting languages such as PHP, Mason, Perl (all configurable from Jericho). AbstractMarkupFilter takes care of the parser initialization and provides default handlers for each token type returned by the parser.

    Handling of translatable text, inline tags, translatable and read-only attributes are configurable through a user defined YAML file. See the Okapi HtmlFilter with defaultConfiguration.yml and OpenXml filters for examples.

    • Method Detail

      • getParsedHeader

        protected net.htmlparser.jericho.Source getParsedHeader​(InputStream inputStream)
      • hasNext

        public boolean hasNext()
        Description copied from interface: IFilter
        Indicates if there is an event to process.

        Implementer Note: The caller must be able to call this method several times without changing state.

        Returns:
        True if there is at least one event to process, false if not.
      • next

        public Event next()
        Queue up Jericho tokens until we can build an Okapi Event and return it.
        Returns:
        The next event available or null if there are no events.
      • createEventBuilder

        protected AbstractMarkupEventBuilder createEventBuilder()
        Delayed initialization of the EventBuilder. This will be called when the filter is initialized if an EventBuilder was not previously passed to the constructor.
        Returns:
      • startFilter

        protected void startFilter()
        Initialize the filter for every input and send the StartDocument Event
      • endFilter

        protected void endFilter()
        End the current filter processing and send the Ending Event
      • isInline

        public boolean isInline​(net.htmlparser.jericho.Segment segment)
        based on rule state and segment type determine if we are still inside a text run
        Parameters:
        segment -
        Returns:
      • preProcess

        protected void preProcess​(net.htmlparser.jericho.Segment segment)
        Do any handling needed before the current Segment is processed. Default is to do nothing.
        Parameters:
        segment -
      • postProcessTextUnit

        protected void postProcessTextUnit​(ITextUnit textUnit)
        Do any required post-processing on the TextUnit before the Event leaves the IFilter. Default implementation leaves Event unchanged. Override this method if you need to do format specific handing such as collapsing whitespace.
      • handleServerCommonEscaped

        protected void handleServerCommonEscaped​(net.htmlparser.jericho.Tag tag)
        Handle any recognized escaped server tags.
        Parameters:
        tag -
      • handleServerCommon

        protected void handleServerCommon​(net.htmlparser.jericho.Tag tag)
        Handle any recognized server tags (i.e., PHP, Mason etc.)
        Parameters:
        tag -
      • handleXmlDeclaration

        protected void handleXmlDeclaration​(net.htmlparser.jericho.Tag tag)
        Handle an XML declaration.
        Parameters:
        tag -
      • handleDocTypeDeclaration

        protected void handleDocTypeDeclaration​(net.htmlparser.jericho.Tag tag)
        Handle the XML doc type declaration (DTD).
        Parameters:
        tag -
      • handleProcessingInstruction

        protected void handleProcessingInstruction​(net.htmlparser.jericho.Tag tag)
        Handle processing instructions.
        Parameters:
        tag -
      • handleComment

        protected void handleComment​(net.htmlparser.jericho.Tag tag)
        Handle comments.
        Parameters:
        tag -
      • handleCdataSection

        protected void handleCdataSection​(net.htmlparser.jericho.Tag tag)
        Handle CDATA sections.
        Parameters:
        tag -
      • handleText

        protected void handleText​(CharSequence text)
        Handle all text (PCDATA).
        Parameters:
        text -
      • isWhiteSpace

        protected boolean isWhiteSpace​(CharSequence text)
      • handleNumericEntity

        protected void handleNumericEntity​(net.htmlparser.jericho.NumericCharacterReference entity)
        Handle all Character entities. Default implementation converts entity to Unicode character.
        Parameters:
        entity - - the character entity
      • handleCharacterEntity

        protected void handleCharacterEntity​(net.htmlparser.jericho.CharacterEntityReference entity)
        Handle all numeric entities. Default implementation converts entity to Unicode character.
        Parameters:
        entity - - the numeric entity
      • handleStartTag

        protected void handleStartTag​(net.htmlparser.jericho.StartTag startTag)
        Handle start tags.
        Parameters:
        startTag -
      • handleEndTag

        protected void handleEndTag​(net.htmlparser.jericho.EndTag endTag)
        Handle end tags, including empty tags.
        Parameters:
        endTag -
      • handleDocumentPart

        protected void handleDocumentPart​(net.htmlparser.jericho.Tag tag)
        Handle anything else not classified by Jericho.
        Parameters:
        tag -
      • normalizeAttributeName

        protected abstract String normalizeAttributeName​(String attrName,
                                                         String attrValue,
                                                         net.htmlparser.jericho.Tag tag)
        Some attributes names are converted to Okapi standards such as HTML charset to "encoding" and lang to "language"
        Parameters:
        attrName - - the attribute name
        attrValue - - the attribute value
        tag - - the Jericho Tag that contains the attribute
        Returns:
        the attribute name after it as passe through the normalization rules
      • addCodeToCurrentTextUnit

        protected void addCodeToCurrentTextUnit​(net.htmlparser.jericho.Tag tag)
        Add an Code to the current TextUnit. Throws an exception if there is no current TextUnit.
        Parameters:
        tag - - the Jericho Tag that is converted to a Okpai Code
      • addCodeToCurrentTextUnit

        protected void addCodeToCurrentTextUnit​(net.htmlparser.jericho.Tag tag,
                                                boolean endCodeNow)
        Add an Code to the current TextUnit. Throws an exception if there is no current TextUnit.
        Parameters:
        tag - - the Jericho Tag that is converted to a Okpai Code
        endCodeNow - - do we end the code now or delay so we can add more content to the code?
      • createPropertyTextUnitPlaceholders

        protected List<PropertyTextUnitPlaceholder> createPropertyTextUnitPlaceholders​(net.htmlparser.jericho.StartTag startTag)
        For the given Jericho StartTag parse out all the actionable attributes and and store them as PropertyTextUnitPlaceholder. PropertyTextUnitPlaceholder.PlaceholderAccessType are set based on the filter configuration for each attribute. for the attribute name and value.
        Parameters:
        startTag - - Jericho StartTag
        Returns:
        all actionable (translatable, writable or read-only) attributes found in the StartTag
      • isUtf8Encoding

        protected boolean isUtf8Encoding()
        Is the input encoded as UTF-8?
        Overrides:
        isUtf8Encoding in class AbstractFilter
        Returns:
        true if the document is in utf8 encoding.
      • isUtf8Bom

        protected boolean isUtf8Bom()
        Does the input have a UTF-8 Byte Order Mark?
        Overrides:
        isUtf8Bom in class AbstractFilter
        Returns:
        true if the document has a utf-8 byte order mark.
      • isBOM

        protected boolean isBOM()
        Does the input have a BOM?
        Returns:
        true if the document has a BOM.
      • isDocumentEncoding

        protected boolean isDocumentEncoding()
        Does this document have a document encoding specified?
        Returns:
        true if has meta tag with encoding, false otherwise
      • isPreserveWhitespace

        protected boolean isPreserveWhitespace()
        Returns:
        the preserveWhitespace boolean.
      • setPreserveWhitespace

        protected void setPreserveWhitespace​(boolean preserveWhitespace)
      • addToDocumentPart

        protected void addToDocumentPart​(String part)
      • addToTextUnit

        protected void addToTextUnit​(String text)
      • startTextUnit

        protected void startTextUnit​(String text)
      • setTextUnitName

        protected void setTextUnitName​(String name)
      • setTextUnitType

        protected void setTextUnitType​(String type)
      • setCurrentDocName

        protected void setCurrentDocName​(String currentDocName)
      • getCurrentDocName

        protected String getCurrentDocName()
      • canStartNewTextUnit

        protected boolean canStartNewTextUnit()
      • isInsideTextRun

        protected boolean isInsideTextRun()
      • addToTextUnit

        protected void addToTextUnit​(Code code,
                                     boolean endCodeNow)
      • addToTextUnit

        protected void addToTextUnit​(Code code)
      • endDocumentPart

        protected void endDocumentPart()
      • startTextUnit

        protected void startTextUnit​(GenericSkeleton startMarker)
      • startTextUnit

        protected void startTextUnit()
      • getTextUnitId

        protected long getTextUnitId()
      • setTextUnitMimeType

        protected void setTextUnitMimeType​(String mimeType)
      • setDocumentPartId

        protected void setDocumentPartId​(long id)
      • peekTempEvent

        protected Event peekTempEvent()
      • setMimeType

        public void setMimeType​(String mimeType)
        Sets the input document mime type.
        Overrides:
        setMimeType in class AbstractFilter
        Parameters:
        mimeType - the new mime type