info.informatica.html
Class HTMLDoc

java.lang.Object
  extended by info.informatica.doc.DocumentFile
      extended by info.informatica.doc.TextDocument
          extended by info.informatica.www.WebPage
              extended by info.informatica.html.HTMLDoc
All Implemented Interfaces:
info.informatica.doc.DocPropertyHandler, info.informatica.doc.Document, Hypertext

public final class HTMLDoc
extends WebPage
implements info.informatica.doc.DocPropertyHandler

HTML document, with API methods for parsing and basic manipulations.

This class provides the hypertext document methods for an HTML page, including methods to extract meta-information, the embedded or linked URLs, etc.

Author:
amengual at informatica dot info

Field Summary
static String XHTML_NAMESPACE_URI
           
 
Constructor Summary
HTMLDoc()
           
HTMLDoc(TagParser parser)
           
 
Method Summary
 BaseFont getBaseFont()
          Gets the BASEFONT tag.
 CharData getCharData(info.informatica.doc.FragmentPosition pos)
          Gets the character data at the given position.
 Iterator<String> getEmbeddedUris()
          Gets the URIs embedded in the document.
 Iterator<URL> getEmbeddedUrls()
          Gets the embedded URLs (images, flash, etc.)
 HTMLFragment getHTML()
          Gets the HTML contents of this document.
 String getHttpEquiv(String name)
          Gets the HTTP EQUIV meta information.
 Enumeration getKeywords()
          Gets the keywords of this document.
 Iterator getLinkedUrls()
          Gets the linked URLs (anchors, etc.)
 String getMetaInfo(String name)
          Gets the meta information from the HTML.
 String getName()
          Gets the name of the page.
 String getObjectProperty(String objectid, String propname)
           
 RobotInfo getRobotInfo()
          Gets the Robot meta-information according to "robots" meta tag.
 HTMLTag getTag(info.informatica.doc.FragmentPosition pos)
          Gets a tag by its position in the document.
 String getTitle()
          Gets the title of the document.
 void load(String doc)
           
 void relativizeEmbedded()
          Removes the schema://host part of all the embedded URLs when it is not needed.
 void setBase()
          Quick & dirty method that looks for the BASE tag in HTML and adjusts BASE URL accordingly.
 void setName(String nombre)
          Sets the name of the HTML page.
 void setObjectProperty(String objectid, String propname, String propvalue)
           
 String toPureText()
          Returns a text version of the document, obtained after erasing all comments and all tags.
 String toString()
           
 
Methods inherited from class info.informatica.www.WebPage
getBase, getURL, setBase, setURL, urlToUri
 
Methods inherited from class info.informatica.doc.TextDocument
getInputStream, getLanguage, load, load, setLanguage, write, write
 
Methods inherited from class info.informatica.doc.DocumentFile
getLastUpdateDate, getMediaType, getSize, setDate, setSize, setTitle, setType
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

XHTML_NAMESPACE_URI

public static final String XHTML_NAMESPACE_URI
See Also:
Constant Field Values
Constructor Detail

HTMLDoc

public HTMLDoc(TagParser parser)

HTMLDoc

public HTMLDoc()
Method Detail

load

public void load(String doc)
Specified by:
load in class info.informatica.doc.TextDocument

setName

public void setName(String nombre)
Sets the name of the HTML page.

Parameters:
nombre - the page name.

getName

public String getName()
Gets the name of the page.

Returns:
page name.

getHTML

public HTMLFragment getHTML()
Gets the HTML contents of this document.

Returns:
the HTML contents.

setBase

public void setBase()
Quick & dirty method that looks for the BASE tag in HTML and adjusts BASE URL accordingly.

Accuracy is not guaranteed with really bad HTML. For example, if the href attribute of is not present, it could find the href of some other tag.

Use getHTML().getTagByName("base").getAttributes().getAttribute("href") if you need better precision.


toPureText

public String toPureText()
Returns a text version of the document, obtained after erasing all comments and all tags.

Specified by:
toPureText in class info.informatica.doc.TextDocument
Returns:
a free-text version of the document.

getKeywords

public Enumeration getKeywords()
Gets the keywords of this document.

Specified by:
getKeywords in class info.informatica.doc.TextDocument
Returns:
an enumeration with the keywords of the document, or null if no keywords were found.

getTitle

public String getTitle()
Gets the title of the document.

Specified by:
getTitle in interface info.informatica.doc.Document
Overrides:
getTitle in class info.informatica.doc.DocumentFile
Returns:
the title of the document as given by the Title tag.

getMetaInfo

public String getMetaInfo(String name)
Gets the meta information from the HTML.

Overrides:
getMetaInfo in class info.informatica.doc.TextDocument
Parameters:
name - the name of the META tag
Returns:
the value of the name meta tag, or null if not found.

getBaseFont

public BaseFont getBaseFont()
Gets the BASEFONT tag.

Returns:
the BASEFONT tag, or null if not found.

getHttpEquiv

public String getHttpEquiv(String name)
Gets the HTTP EQUIV meta information.

Specified by:
getHttpEquiv in class WebPage
Parameters:
name - the name of the HTTP EQUIV meta field.
Returns:
the value of the requested HTTP EQUIV meta field.

getRobotInfo

public RobotInfo getRobotInfo()
Gets the Robot meta-information according to "robots" meta tag.

Specified by:
getRobotInfo in interface Hypertext
Specified by:
getRobotInfo in class WebPage
Returns:
the robot meta information.

setObjectProperty

public void setObjectProperty(String objectid,
                              String propname,
                              String propvalue)
Specified by:
setObjectProperty in interface info.informatica.doc.DocPropertyHandler

getObjectProperty

public String getObjectProperty(String objectid,
                                String propname)
Specified by:
getObjectProperty in interface info.informatica.doc.DocPropertyHandler

getCharData

public CharData getCharData(info.informatica.doc.FragmentPosition pos)
Gets the character data at the given position.

Parameters:
pos - the position.
Returns:
the character data at the position.

getTag

public HTMLTag getTag(info.informatica.doc.FragmentPosition pos)
               throws TagParsingException
Gets a tag by its position in the document.

Parameters:
pos - the position of the Tag
Returns:
the tag
Throws:
TagParsingException

relativizeEmbedded

public void relativizeEmbedded()
Removes the schema://host part of all the embedded URLs when it is not needed.

Useful to mount a document under a different host.


getEmbeddedUrls

public Iterator<URL> getEmbeddedUrls()
Gets the embedded URLs (images, flash, etc.)

Specified by:
getEmbeddedUrls in interface Hypertext
Returns:
an iterator for the embedded URLs (images, flash, etc.)

getEmbeddedUris

public Iterator<String> getEmbeddedUris()
Gets the URIs embedded in the document.

Returns:
an iterator with all the URIs embedded in the document.

getLinkedUrls

public Iterator getLinkedUrls()
Gets the linked URLs (anchors, etc.)

Specified by:
getLinkedUrls in interface Hypertext
Returns:
an iterator with the linked URLs

toString

public String toString()
Specified by:
toString in class info.informatica.doc.TextDocument