|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectinfo.informatica.doc.DocumentFile
info.informatica.doc.TextDocument
info.informatica.www.WebPage
info.informatica.html.HTMLDoc
public final class HTMLDoc
HTML document, with API methods for parsing and basic manipulations.
This class provides the hypertext document methods for an HTML page, including methods to extract meta-information, the embedded or linked URLs, etc.
| Field Summary | |
|---|---|
static String |
XHTML_NAMESPACE_URI
|
| Constructor Summary | |
|---|---|
HTMLDoc()
|
|
HTMLDoc(TagParser parser)
|
|
| Method Summary | |
|---|---|
BaseFont |
getBaseFont()
Gets the BASEFONT tag. |
CharData |
getCharData(info.informatica.doc.FragmentPosition pos)
Gets the character data at the given position. |
Iterator<String> |
getEmbeddedUris()
Gets the URIs embedded in the document. |
Iterator<URL> |
getEmbeddedUrls()
Gets the embedded URLs (images, flash, etc.) |
HTMLFragment |
getHTML()
Gets the HTML contents of this document. |
String |
getHttpEquiv(String name)
Gets the HTTP EQUIV meta information. |
Enumeration |
getKeywords()
Gets the keywords of this document. |
Iterator |
getLinkedUrls()
Gets the linked URLs (anchors, etc.) |
String |
getMetaInfo(String name)
Gets the meta information from the HTML. |
String |
getName()
Gets the name of the page. |
String |
getObjectProperty(String objectid,
String propname)
|
RobotInfo |
getRobotInfo()
Gets the Robot meta-information according to "robots" meta tag. |
HTMLTag |
getTag(info.informatica.doc.FragmentPosition pos)
Gets a tag by its position in the document. |
String |
getTitle()
Gets the title of the document. |
void |
load(String doc)
|
void |
relativizeEmbedded()
Removes the schema://host part of all the embedded URLs when it is not needed. |
void |
setBase()
Quick & dirty method that looks for the BASE tag in HTML and adjusts BASE URL accordingly. |
void |
setName(String nombre)
Sets the name of the HTML page. |
void |
setObjectProperty(String objectid,
String propname,
String propvalue)
|
String |
toPureText()
Returns a text version of the document, obtained after erasing all comments and all tags. |
String |
toString()
|
| Methods inherited from class info.informatica.www.WebPage |
|---|
getBase, getURL, setBase, setURL, urlToUri |
| Methods inherited from class info.informatica.doc.TextDocument |
|---|
getInputStream, getLanguage, load, load, setLanguage, write, write |
| Methods inherited from class info.informatica.doc.DocumentFile |
|---|
getLastUpdateDate, getMediaType, getSize, setDate, setSize, setTitle, setType |
| Methods inherited from class java.lang.Object |
|---|
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
public static final String XHTML_NAMESPACE_URI
| Constructor Detail |
|---|
public HTMLDoc(TagParser parser)
public HTMLDoc()
| Method Detail |
|---|
public void load(String doc)
load in class info.informatica.doc.TextDocumentpublic void setName(String nombre)
nombre - the page name.public String getName()
public HTMLFragment getHTML()
public void setBase()
Accuracy is not guaranteed with really bad HTML. For example, if the href attribute
of
Use getHTML().getTagByName("base").getAttributes().getAttribute("href")
if you need better precision.
public String toPureText()
toPureText in class info.informatica.doc.TextDocumentpublic Enumeration getKeywords()
getKeywords in class info.informatica.doc.TextDocumentpublic String getTitle()
getTitle in interface info.informatica.doc.DocumentgetTitle in class info.informatica.doc.DocumentFilepublic String getMetaInfo(String name)
getMetaInfo in class info.informatica.doc.TextDocumentname - the name of the META tag
name meta tag, or null if not
found.public BaseFont getBaseFont()
public String getHttpEquiv(String name)
getHttpEquiv in class WebPagename - the name of the HTTP EQUIV meta field.
public RobotInfo getRobotInfo()
getRobotInfo in interface HypertextgetRobotInfo in class WebPage
public void setObjectProperty(String objectid,
String propname,
String propvalue)
setObjectProperty in interface info.informatica.doc.DocPropertyHandler
public String getObjectProperty(String objectid,
String propname)
getObjectProperty in interface info.informatica.doc.DocPropertyHandlerpublic CharData getCharData(info.informatica.doc.FragmentPosition pos)
pos - the position.
public HTMLTag getTag(info.informatica.doc.FragmentPosition pos)
throws TagParsingException
pos - the position of the Tag
TagParsingExceptionpublic void relativizeEmbedded()
Useful to mount a document under a different host.
public Iterator<URL> getEmbeddedUrls()
getEmbeddedUrls in interface Hypertextpublic Iterator<String> getEmbeddedUris()
public Iterator getLinkedUrls()
getLinkedUrls in interface Hypertextpublic String toString()
toString in class info.informatica.doc.TextDocument
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||