Package info.informatica.html

HTMLDoc, HTMLFragment and HTMLTag It is an encapsulation of an HTML document, with a simple permissive parser which can handle even most of the bad, non-compliant HTML documents of the real world.

See:
          Description

Interface Summary
HTMLEventHandler Event handler.
 

Class Summary
CharData HTML CData.
DefaultHTMLEventHandler  
HTMLDoc HTML document, with API methods for parsing and basic manipulations.
HTMLEventParser HTML Event parser.
HTMLFragment HTML fragment, with API methods for parsing and basic manipulations.
HTMLUtil HTML utilities.
IdTagFinder Finds tags by ID.
NameTagFinder Finds tags by type.
TagFinder This class and subclasses contain serveral methods for finding tags by its type and ID.
TagIterator Iterates by tags that meet the criteria of a TagFinder class.
 

Exception Summary
HTMLDocumentException  
HTMLParsingException  
 

Package info.informatica.html Description

HTMLDoc, HTMLFragment and HTMLTag

It is an encapsulation of an HTML document, with a simple permissive parser which can handle even most of the bad, non-compliant HTML documents of the real world.

First, you have to load the HTML document using the load method. Once you have the document, you can use both low-level and high-level methods to do what you want.

A few common uses

When you want to access individual tag of your document, always use first the getHTML to get the HTMLFragment. If you want to recover an HTML tag whose ID you knowm just use the getTagById(String id) method. The HTMLTag object returned lets you access the attributes with the getAttributes method, and then you can retrieve or set any attribute.

Some HTML documents have absolute references to its inline content (inline images, sound, etc.), which makes it difficult to display the page under a different domain (a typical case would be the cache of a search engine). But then, you simply make a call to relativizeEmbedded, and the document no longer has this behaviour. Before using this method, make sure that the BASE URL has been properly set.

getMetaInfo lets you easily retrieve any document meta-information, though there are also specialized methods like getHttpEquiv or getKeywords that directly access those meta-informations.

Example

The following example does some tag-level manipulations:

    Reader re = ...
    // Create the document
    HTMLDoc doc = new HTMLDoc();
    // Load its content
    doc.load(re);
    // Get the HTML
    HTMLFragment html = doc.getHTML();
    // Create a 'date' meta-tag
    HTMLTag tag = HTMLTag.parse("<meta name=\"date\" content=21/01/2001>");
    // Insert it just before the title
    html.insertBefore(html.findTagByName("title"), tag);
    // Create a paragraph
    tag = HTMLTag.create("p");
    // Insert '<p>Paragraph</p>' just before a tag with id="someid"
    html.insertBefore(html.getIdFinder("someid").getTag().getPosition(),
        tag.toString("Paragraph"));
    // Create an anchor to foo.html
    HTMLTag anchor = HTMLTag.parse("<a href=\"foo.html\">");
    // We could also do a 'HTMLTag.create("a")' and then set the 'href'
    // attribute using getAttributes().setAttribute("href", "foo.html")
    //
    // Now we get a tag block with id="otherid"
    tag = html.getIdFinder("otherid").getTagBlock();
    // Replace the tag that has id="otherid" by the same tag
    // embraced by the foo.html anchor
    html.replace(tag.getBlockPosition(), anchor.toString(tag));
    // For example, if the 'otherid' tag was 'img src="something.jpg"',
    // then the result would be:
    //   '<a href="foo.html"><img id="otherid" src="something.jpg"></a>'
    //
    tag = html.getTagByName("meta");
    // We just got the first 'meta' tag found in the document, and now we
    // set its name attribute to 'last_update', and its value
    // (the 'content' attribute) to "20/01/2001"
    tag.getAttributes().setAttribute("name", "last_update");
    tag.getAttributes().setAttribute("content", "20/01/2001");
    // Commit the changes to the 'meta' tag to the document
    html.update(tag);