Sep 29
The HTMLEditorKit class can be used to parse the HTML. By subclassing the HTMLDocument class you can extract the parts of the document you are interested in.
Following example shows how to extract just the text.
URL url = new URL("http://www.objects.com.au");
Reader reader = new InputStreamReader(
url.openConnection().getInputStream());
EditorKit editorKit = new HTMLEditorKit();
HTMLText htmlText = new HTMLText();
// Parse the HTML
editorKit.read(reader, htmlText, 0);
// Get the extracted text
String text = htmlText.getText();
public class HTMLText extends HTMLDocument
{
// stores any text found in document
private StringBuilder text = new StringBuilder();
/**
* Returns any text found in the document during parsing
*/
public String getText()
{
return text.toString();
}
public HTMLEditorKit.ParserCallback getReader(int pos)
{
return new TextCallback();
}
class TextCallBack extends HTMLEditorKit.ParserCallback
{
/** Invoked when text is encounted during parsing */
public void handleText(char[] data, int pos)
{
text.append(data);
text.append('\n');
}
}
}