用 JAVA 解析网站 HTML [英] Parse Web Site HTML with JAVA

查看:27
本文介绍了用 JAVA 解析网站 HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析一个简单的网站并从该网站抓取信息.

I want to parse a simple web site and scrape information from that web site.

我曾经用 DocumentBuilderFactory 解析 XML 文件,我试图对 html 文件做同样的事情,但它总是进入无限循环.

I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.

    URL url = new URL("http://www.deneme.com");
    URLConnection uc = url.openConnection();

    InputStreamReader input = new InputStreamReader(uc.getInputStream());
    BufferedReader in = new BufferedReader(input);
    String inputLine;

     FileWriter outFile = new FileWriter("orhancan");
     PrintWriter out = new PrintWriter(outFile);

    while ((inputLine = in.readLine()) != null) {
        out.println(inputLine);
    }

    in.close();
    out.close();

    File fXmlFile = new File("orhancan");
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    Document doc = dBuilder.parse(fXmlFile);


    NodeList prelist = doc.getElementsByTagName("body");
    System.out.println(prelist.getLength());

有什么问题吗?或者有没有更简单的方法可以从网站上抓取给定 html 标签的数据?

Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?

推荐答案

有一个更简单的方法来做到这一点.我建议使用 JSoup.使用 JSoup,您可以执行诸如

There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

或者如果你想要身体:

Elements body = doc.select("body");

或者如果您想要所有链接:

Or if you want all links:

Elements links = doc.select("body a");

您不再需要获取连接或处理流.简单的.如果您曾经使用过 jQuery,那么它与此非常相似.

You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.

这篇关于用 JAVA 解析网站 HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆