|
||||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||
See:
Description
| Interface Summary | |
|---|---|
| HTMLEventHandler | Event handler. |
| Class Summary | |
|---|---|
| CharData | HTML CData. |
| DefaultHTMLEventHandler | |
| HTMLDoc | HTML document, with API methods for parsing and basic manipulations. |
| HTMLEventParser | HTML Event parser. |
| HTMLFragment | HTML fragment, with API methods for parsing and basic manipulations. |
| HTMLUtil | HTML utilities. |
| IdTagFinder | Finds tags by ID. |
| NameTagFinder | Finds tags by type. |
| TagFinder | This class and subclasses contain serveral methods for finding tags by its type and ID. |
| TagIterator | Iterates by tags that meet the criteria of a TagFinder class. |
| Exception Summary | |
|---|---|
| HTMLDocumentException | |
| HTMLParsingException | |
It is an encapsulation of an HTML document, with a simple permissive parser which can handle even most of the bad, non-compliant HTML documents of the real world.
First, you have to load the HTML document using the load method. Once you
have the document, you can use both low-level and high-level methods to do what you want.
getHTML method you can access individual tags by ID
(the "id" attribute) or by name (which does not mean a "name" attribute, but the type of tag
itself, for example "img"), and then do tag-related operations.toPureText).
When you want to access individual tag of your document, always use first the getHTML
to get the HTMLFragment.
If you want to recover an HTML tag whose ID you knowm just use the getTagById(String id)
method. The HTMLTag object returned lets you access the attributes with the
getAttributes method, and then you can retrieve or set any attribute.
Some HTML documents have absolute references to its inline content (inline images, sound, etc.),
which makes it difficult to display the page under a different domain (a typical case would
be the cache of a search engine). But then, you simply make a call to relativizeEmbedded,
and the document no longer has this behaviour. Before using this method, make sure that the
BASE URL has been properly set.
getMetaInfo lets you easily retrieve any document meta-information, though there
are also specialized methods like getHttpEquiv or getKeywords that
directly access those meta-informations.
Example
The following example does some tag-level manipulations:
Reader re = ...
// Create the document
HTMLDoc doc = new HTMLDoc();
// Load its content
doc.load(re);
// Get the HTML
HTMLFragment html = doc.getHTML();
// Create a 'date' meta-tag
HTMLTag tag = HTMLTag.parse("<meta name=\"date\" content=21/01/2001>");
// Insert it just before the title
html.insertBefore(html.findTagByName("title"), tag);
// Create a paragraph
tag = HTMLTag.create("p");
// Insert '<p>Paragraph</p>' just before a tag with id="someid"
html.insertBefore(html.getIdFinder("someid").getTag().getPosition(),
tag.toString("Paragraph"));
// Create an anchor to foo.html
HTMLTag anchor = HTMLTag.parse("<a href=\"foo.html\">");
// We could also do a 'HTMLTag.create("a")' and then set the 'href'
// attribute using getAttributes().setAttribute("href", "foo.html")
//
// Now we get a tag block with id="otherid"
tag = html.getIdFinder("otherid").getTagBlock();
// Replace the tag that has id="otherid" by the same tag
// embraced by the foo.html anchor
html.replace(tag.getBlockPosition(), anchor.toString(tag));
// For example, if the 'otherid' tag was 'img src="something.jpg"',
// then the result would be:
// '<a href="foo.html"><img id="otherid" src="something.jpg"></a>'
//
tag = html.getTagByName("meta");
// We just got the first 'meta' tag found in the document, and now we
// set its name attribute to 'last_update', and its value
// (the 'content' attribute) to "20/01/2001"
tag.getAttributes().setAttribute("name", "last_update");
tag.getAttributes().setAttribute("content", "20/01/2001");
// Commit the changes to the 'meta' tag to the document
html.update(tag);
|
||||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||