Sep 29

The HTMLEditorKit class can be used to parse the HTML. By subclassing the HTMLDocument class you can extract the parts of the document you are interested in.

Following example shows how to extract just the text.


URL url = new URL("http://www.objects.com.au");
Reader reader = new InputStreamReader(
   url.openConnection().getInputStream());
EditorKit editorKit = new HTMLEditorKit();
HTMLText htmlText = new HTMLText();

// Parse the HTML

editorKit.read(reader, htmlText, 0);

// Get the extracted text

String text = htmlText.getText();



public class HTMLText extends HTMLDocument
{
    // stores any text found in document

    private StringBuilder text = new StringBuilder();

    /**
    *  Returns any text found in the document during parsing
    */

    public String getText()
    {
        return text.toString();
    }

    public HTMLEditorKit.ParserCallback getReader(int pos)
    {
        return new TextCallback();
    }

    class TextCallback extends HTMLEditorKit.ParserCallback
    {
       /** Invoked when text is encounted during parsing */

       public void handleText(char[] data, int pos)
       {
          text.append(data);
          text.append('\n');
       }
    }
} 

written by objects


Leave a Reply

You must be logged in to post a comment.