如何将网页的Html源代码转换为java中的org.w3c.dom.Document? [英] How to convert an Html source of a webpage into org.w3c.dom.Document in java?

查看:113
本文介绍了如何将网页的Html源代码转换为java中的org.w3c.dom.Document?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将网页的Html源代码转换为org.w3c.dom.Documentin Java?

How to convert an Html source of a webpage into org.w3c.dom.Documentin Java?

推荐答案

这实际上是一个相当困难的事情,因为任意的HTML网页有时是畸形的(主要的浏览器是相当宽容的)。您可能需要查看摆动html解析器,其中I我从来没有尝试过,但看起来可能是最好的选择。你也可以尝试一些东西,并处理可能出现的任何解析异常(尽管我只曾尝试过这样做的XML):

That's actually a fairly difficult thing to do robustly, because arbitrary HTML web pages are sometimes malformed (the major browsers are fairly tolerant). You may want to look into the swing html parser, which I've never tried but looks like it may be the best option. You also could try something along the lines of this and handle any parsing exceptions that may come up (although I've only ever tried this for xml):

import java.io.File;
import org.w3c.dom.Document;
import org.w3c.dom.*;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException; 

...

try {
    DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
    Document doc = docBuilder.parse (InputStreamYouBuiltEarlierFromAnHTTPRequest);
}
catch (ParserConfigurationException e)
{
    ...
}
catch (SAXException e)
{
    ...
}
catch (IOException e)
{
    ...
}

...

这篇关于如何将网页的Html源代码转换为java中的org.w3c.dom.Document?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆