info.informatica.html
Class HTMLFragment

java.lang.Object
  extended by info.informatica.doc.DocumentFragment
      extended by info.informatica.html.HTMLFragment
All Implemented Interfaces:
Comparable<info.informatica.doc.DocumentFragment>

public class HTMLFragment
extends info.informatica.doc.DocumentFragment

HTML fragment, with API methods for parsing and basic manipulations.

This class provides simple and fast parsing capabilities, written with these ideas in mind:

  1. The parser has to be permissive with the HTML, avoiding common problems with other parsers, where real-world HTML crashes the parsing too often.
  2. Should be easy to use, providing direct API methods to do the things that are most usually done.
  3. Parse the less possible: make the parsing oriented for accessing specific parts of the document, as opposed to APIs like DOM or SAX, where you essentially parse all the document as HTML. This approach is especially useful for using the HTML document like a template, without inefficiently wasting resources. If you want to parse the entire document, use HTMLEventParser instead.
When you need to access a particular tag, you can identify it by several methods:

This class provides a fast and small-footprint approach to parsing.

Author:
amengual at informatica dot info
See Also:
HTMLEventParser

Nested Class Summary
 
Nested classes/interfaces inherited from class info.informatica.doc.DocumentFragment
info.informatica.doc.DocumentFragment.FragmentComp, info.informatica.doc.DocumentFragment.NotFragmentComp
 
Constructor Summary
HTMLFragment(info.informatica.doc.DocumentFragment fragment)
           
HTMLFragment(HTMLTag tag)
           
HTMLFragment(String html)
           
HTMLFragment(String html, info.informatica.doc.FragmentPosition pos)
           
 
Method Summary
 String eraseComments()
          Gets a version of this fragment with all the comments erased (substituted by spaces).
 String eraseTags()
          Gets a version of this fragment where each tag of this document has been replaced by a blank space, including comments.
static String eraseTags(String html)
          Replaces each tag of the given HTML text by a blank space.
 info.informatica.doc.FragmentPosition findBlockByName(String tagname)
          Returns the position of a start-end tag and the enclosed fragment (this is called a block).
 info.informatica.doc.FragmentPosition findTagByName(String tagname)
          Gets the position of a tag of a given type.
 info.informatica.doc.FragmentPosition findTagByName(String tagname, int inipos)
          Gets the position of a tag of a given type, starting to search for it at a given place.
 TagIterator getAllTags()
           
 CharData getCharData(info.informatica.doc.FragmentPosition pos)
          Gets the character data at the given position.
 TagFinder getFinder()
          Gets a finder of all tags.
 IdTagFinder getIdFinder(String tagid)
          Gets a finder of tags of the given ID.
 NameTagFinder getNameFinder(String tagname)
          Gets a finder of tags of the given type.
 HTMLTag getTag(info.informatica.doc.FragmentPosition pos)
          Gets a tag by its position in the document.
 HTMLTag getTagBlockByName(String tagname, int inipos)
          Gets the tag block consisting of the tag named tagname and the enclosed character data.
 HTMLTag getTagById(String tagid)
          Convenience method that gets the tag of ID tagid.
 HTMLTag getTagByName(String tagname)
          Convenience method that gets the tag of type taname.
 CharData getTagDataById(String tagid)
          Gets the Character Data enclosed by given tag of ID tagid.
 CharData getTagDataByName(String tagname)
          Gets the Character Data enclosed by given tag of name tagname.
 CharData getTagDataByName(String tagname, int inipos)
          Gets the Character Data enclosed by given tag of name tagname that starts at position inipos.
 TagParser getTagParser()
          Gets the tag parser that will be used to parse this fragment's tags.
 TagIterator getTagsById(String tagid)
          Gets all the tags of type tagname in the document.
 TagIterator getTagsByName(String tagname)
          Gets all the tags of type tagname in the document.
 void insertAfter(info.informatica.doc.FragmentPosition pos, HTMLFragment newel)
          Insert a fragment after the given position.
 void insertAfter(info.informatica.doc.FragmentPosition pos, String newstr)
          Insert a String after the given position.
 void insertBefore(info.informatica.doc.FragmentPosition pos, info.informatica.doc.DocumentFragment newel)
          Insert an element before the given position.
 void insertBefore(info.informatica.doc.FragmentPosition pos, String newstr)
          Insert a string before the given position.
 int length()
           
 void remove(info.informatica.doc.FragmentPosition pos)
          Removes an HTML fragment.
 void removeBlock(HTMLTag tag)
          Removes a tag and all the enclosed fragments, if any.
 void removePair(HTMLTag tag)
          Removes both the start and end tag (if any).
 void replace(info.informatica.doc.FragmentPosition pos, HTMLFragment newel)
          Replaces a subfragment with a new one.
 void replace(info.informatica.doc.FragmentPosition pos, String newstr)
          Replaces a subfragment with a string.
 void setTagParser(TagParser tagParser)
          Sets the tag parser that will be used to parse this fragment's tags.
 String toPureText()
          Gets a text version of the fragment, obtained after erasing all comments and all tags.
static String toPureText(String s)
          Gets the plain text version of a String containing HTML.
 String toString()
           
 void update(info.informatica.doc.DocumentFragment e)
          Updates the given subfragment in the document.
 
Methods inherited from class info.informatica.doc.DocumentFragment
adjustWidth, compareTo, getCurrentPosition, getPosition, setPosition
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

HTMLFragment

public HTMLFragment(String html)

HTMLFragment

public HTMLFragment(String html,
                    info.informatica.doc.FragmentPosition pos)

HTMLFragment

public HTMLFragment(info.informatica.doc.DocumentFragment fragment)

HTMLFragment

public HTMLFragment(HTMLTag tag)
Method Detail

remove

public void remove(info.informatica.doc.FragmentPosition pos)
            throws HTMLDocumentException
Removes an HTML fragment.

Throws:
HTMLParsingException - if removal could not be done.
HTMLDocumentException

replace

public void replace(info.informatica.doc.FragmentPosition pos,
                    HTMLFragment newel)
             throws HTMLParsingException
Replaces a subfragment with a new one.

Parameters:
newel - the new subfragment which replaces the old one.
pos - the position where the old fragment is.
Throws:
HTMLParsingException - if replacement was not successful.

replace

public void replace(info.informatica.doc.FragmentPosition pos,
                    String newstr)
             throws HTMLParsingException
Replaces a subfragment with a string.

Parameters:
newstr - the String which replaces the subfragment.
pos - the position where the old fragment is.
Throws:
HTMLParsingException - if replacement was not successful.

update

public void update(info.informatica.doc.DocumentFragment e)
Updates the given subfragment in the document.

Should be called immediately each time an element is modified, if you want to keep the consistency of the document.

Parameters:
e - the element to update.

removePair

public void removePair(HTMLTag tag)
                throws HTMLDocumentException
Removes both the start and end tag (if any). Enclosed fragments are kept.

Parameters:
tag - the tag to be removed.
Throws:
HTMLDocumentException

removeBlock

public void removeBlock(HTMLTag tag)
                 throws info.informatica.doc.DocumentException
Removes a tag and all the enclosed fragments, if any.

Parameters:
tag - the tag to be removed.
Throws:
info.informatica.doc.DocumentException

findBlockByName

public info.informatica.doc.FragmentPosition findBlockByName(String tagname)
Returns the position of a start-end tag and the enclosed fragment (this is called a block).

Parameters:
tagname - name of the tag.
Returns:
the position of the requested block, or null if no such block was found.

insertBefore

public void insertBefore(info.informatica.doc.FragmentPosition pos,
                         info.informatica.doc.DocumentFragment newel)
                  throws HTMLParsingException
Insert an element before the given position.

Parameters:
pos - the position before which the element must be inserted.
newel - the element to be inserted.
Throws:
HTMLParsingException - if cannot insert at given position.

insertBefore

public void insertBefore(info.informatica.doc.FragmentPosition pos,
                         String newstr)
                  throws HTMLParsingException
Insert a string before the given position.

Parameters:
pos - the position before which the string must be inserted.
newstr - the string to be inserted.
Throws:
HTMLParsingException - if cannot insert at given position.

insertAfter

public void insertAfter(info.informatica.doc.FragmentPosition pos,
                        HTMLFragment newel)
                 throws HTMLParsingException
Insert a fragment after the given position.

Parameters:
pos - the position after which the element must be inserted.
newel - the element to be inserted.
Throws:
HTMLParsingException - if cannot insert at given position.

insertAfter

public void insertAfter(info.informatica.doc.FragmentPosition pos,
                        String newstr)
                 throws HTMLParsingException
Insert a String after the given position.

Parameters:
pos - the position after which the string must be inserted.
newstr - the string to be inserted.
Throws:
HTMLParsingException - if cannot insert at given position.

eraseTags

public String eraseTags()
Gets a version of this fragment where each tag of this document has been replaced by a blank space, including comments.

This fragment remains unaltered, just an erased version is returned.

Returns:
the HTML after erasing all tags and comments.

eraseTags

public static String eraseTags(String html)
Replaces each tag of the given HTML text by a blank space.

Does not do the same with comments.

Parameters:
html - the HTML to be processed.
Returns:
the HTML after erasing all tags.

eraseComments

public String eraseComments()
Gets a version of this fragment with all the comments erased (substituted by spaces).

The size of the returned fragment String is preserved, the comments are just filled with spaces.

Returns:
a String containing the HTML fragment with the comments erased.

toPureText

public static String toPureText(String s)
Gets the plain text version of a String containing HTML.

Parameters:
s - the string containing the HTML.
Returns:
the content of the HTML after filtering all comments and tags.

toPureText

public String toPureText()
Gets a text version of the fragment, obtained after erasing all comments and all tags.

Returns:
a free-text version of the fragment.

getCharData

public CharData getCharData(info.informatica.doc.FragmentPosition pos)
Gets the character data at the given position.

Parameters:
pos - the position.
Returns:
the character data at the position.

getTagBlockByName

public HTMLTag getTagBlockByName(String tagname,
                                 int inipos)
Gets the tag block consisting of the tag named tagname and the enclosed character data. Starts the search at position inipos.

Be careful using blocks for tags with optional end tag. You may get a block enclosed by the current start tag and the end tag of ANOTHER tag. In principle, use Tag Blocks only when you have in advance some information about the document tag layout.

Parameters:
tagname - Tag name.
inipos - position to start search.
Returns:
the requested Tag, which includes the character data enclosed by it, or null if tag could not be found.

getTagDataByName

public CharData getTagDataByName(String tagname)
Gets the Character Data enclosed by given tag of name tagname.

Be careful using Tag Data for tags with optional end tag.

Parameters:
tagname - Tag name.
Returns:
Character Data enclosed by the given Tag.

getTagDataByName

public CharData getTagDataByName(String tagname,
                                 int inipos)
Gets the Character Data enclosed by given tag of name tagname that starts at position inipos.

Be careful using Tag Data for tags with optional end tag.

Parameters:
tagname - Tag name.
inipos - position to start search.
Returns:
Character Data enclosed by the given Tag, or null if tag could not be found.

getTagDataById

public CharData getTagDataById(String tagid)
Gets the Character Data enclosed by given tag of ID tagid.

Be careful using Tag Data for tags with optional end tag.

Parameters:
tagid - Tag ID.
Returns:
Character Data enclosed by the opening/close of the given Tag, or null if the tag was not found.

getTagById

public HTMLTag getTagById(String tagid)
Convenience method that gets the tag of ID tagid.

Parameters:
tagid - Tag ID.
Returns:
the requested HTML tag.

getTagByName

public HTMLTag getTagByName(String tagname)
Convenience method that gets the tag of type taname.

Parameters:
tagname - Tag name.
Returns:
the requested HTML tag.

getTagParser

public final TagParser getTagParser()
Gets the tag parser that will be used to parse this fragment's tags.

Returns:
the tag parser.

setTagParser

public void setTagParser(TagParser tagParser)
Sets the tag parser that will be used to parse this fragment's tags.

Parameters:
tagParser - the tag parser.

getTag

public HTMLTag getTag(info.informatica.doc.FragmentPosition pos)
               throws TagParsingException
Gets a tag by its position in the document.

Parameters:
pos - the position of the Tag
Returns:
the tag
Throws:
TagParsingException

getTagsByName

public TagIterator getTagsByName(String tagname)
Gets all the tags of type tagname in the document.

Parameters:
tagname - the name of the Tags to be retrieved.
Returns:
an iterator of all Tags with the given name.

getTagsById

public TagIterator getTagsById(String tagid)
Gets all the tags of type tagname in the document.

Parameters:
tagid - the ID of the Tags to be retrieved.
Returns:
an iterator of all Tags with the given name.

findTagByName

public info.informatica.doc.FragmentPosition findTagByName(String tagname)
Gets the position of a tag of a given type.

Parameters:
tagname - the name (type) of the tag.
Returns:
the position where the tag is, or null if could not be found.

findTagByName

public info.informatica.doc.FragmentPosition findTagByName(String tagname,
                                                           int inipos)
Gets the position of a tag of a given type, starting to search for it at a given place.

Parameters:
tagname - the name (type) of the tag.
inipos - the first place in the document to start searching.
Returns:
the position where the tag is, or null if could not be found.

getAllTags

public TagIterator getAllTags()

getFinder

public TagFinder getFinder()
Gets a finder of all tags.

Returns:
a finder class for all tags.

getNameFinder

public NameTagFinder getNameFinder(String tagname)
Gets a finder of tags of the given type.

Parameters:
tagname - the name (type) of the tags to look for.
Returns:
a finder class for that type of tag.

getIdFinder

public IdTagFinder getIdFinder(String tagid)
Gets a finder of tags of the given ID.

Parameters:
tagid - the ID of the tags to look for.
Returns:
a finder class for that tag ID.

toString

public String toString()
Overrides:
toString in class Object

length

public final int length()
Specified by:
length in class info.informatica.doc.DocumentFragment