<< PreviousOctober 2005Next >>

Tuesday 25 October, 2005
#Building a DOM with TagSoup
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.XMLReader;
import org.xml.sax.InputSource;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.dom.DOMResult;
import java.net.URL;

  URL url = new URL(whatever);
  XMLReader reader = new Parser();
  reader.setFeature(Parser.namespacesFeature, false);
  reader.setFeature(Parser.namespacePrefixesFeature, false);

  Transformer transformer = TransformerFactory.newInstance().newTransformer();
  DOMResult result = new DOMResult();
  transformer.transform(new SAXSource(reader, new InputSource(url.openStream())), 
  // here we go - an DOM built from abitrary HTML
  return result.getNode();

See also Screenscraping HTML with TagSoup and XPath for an example using Xalan. It's less portable, but almost certainly has lower memory overhead. However, if memory is a problem, portability probably isn't ...

Oracle alert! If you're using the Oracle XSLT Transformer, you may get bitten by a bug in DOMSource. If this is the case, you'll have to transform to a StreamResult wrapped around a StringWriter, hook the String out, and then feed that into a DocumentBuilder. Still, it keeps Larry Ellison in jetplanes.

Further Oracle Alert: Looks like their SAXSource might be bug-bound too. Bloody useless. I haven't investigated in detail, but it throws an exception in a place where Saxon and Xalan don't. [added 31st Oct 2005]
Robert Lowe [e] [w] said Useful, thanks! [added 22nd Mar 2009]
Here's an example using TagSoup with JDK5 built-in XPath processor. [added 12th Mar 2010]

[Add a comment]