用JAVA解析网站HTML [英] Parse Web Site HTML with JAVA

查看:90
本文介绍了用JAVA解析网站HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我曾经用DocumentBuilderFactory解析XML文件,我试图去做同样的事情。

我想解析一个简单的网站并从该网站上抓取信息。对于html文件,但它总是进入无限循环。

  URL url = new URL(http:// www。 deneme.com); 
URLConnection uc = url.openConnection();

InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;

FileWriter outFile = new FileWriter(orhancan);
PrintWriter out = new PrintWriter(outFile); ((inputLine = in.readLine())!= null){
out.println(inputLine);


}

in.close();
out.close();

文件fXmlFile =新文件(orhancan);
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);


NodeList prelist = doc.getElementsByTagName(body);
System.out.println(prelist.getLength());

什么是问题?或者,有没有更容易的方法从一个给定的HTML标记从网站上刮取数据? 解决方案

有一种更简单的方法去做这个。我建议使用 JSoup 。使用JSoup,您可以执行以下操作:

  Document doc = Jsoup.connect(http://en.wikipedia.org/ )。得到(); 
元素newsHeadlines = doc.select(#mp-itn b a);

或者如果您想要body:

 元素body = doc.select(body); 

或者如果您想要所有链接:

 元素链接= doc.select(body a); 

您不再需要连接或处理流。简单。如果你曾经使用jQuery,那么它非常相似。

I want to parse a simple web site and scrape information from that web site.

I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.

    URL url = new URL("http://www.deneme.com");
    URLConnection uc = url.openConnection();

    InputStreamReader input = new InputStreamReader(uc.getInputStream());
    BufferedReader in = new BufferedReader(input);
    String inputLine;

     FileWriter outFile = new FileWriter("orhancan");
     PrintWriter out = new PrintWriter(outFile);

    while ((inputLine = in.readLine()) != null) {
        out.println(inputLine);
    }

    in.close();
    out.close();

    File fXmlFile = new File("orhancan");
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    Document doc = dBuilder.parse(fXmlFile);


    NodeList prelist = doc.getElementsByTagName("body");
    System.out.println(prelist.getLength());

Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?

解决方案

There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Or if you want the body:

Elements body = doc.select("body");

Or if you want all links:

Elements links = doc.select("body a");

You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.

这篇关于用JAVA解析网站HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆