用JAVA解析网站HTML [英] Parse Web Site HTML with JAVA

查看：90 发布时间：2018/6/13 9:57:12 java html scrape

本文介绍了用JAVA解析网站HTML的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我曾经用DocumentBuilderFactory解析XML文件，我试图去做同样的事情。

我想解析一个简单的网站并从该网站上抓取信息。对于html文件，但它总是进入无限循环。

  URL url = new URL（http：// www。 deneme.com）; 
 URLConnection uc = url.openConnection（）; 
 
 InputStreamReader input = new InputStreamReader（uc.getInputStream（））; 
 BufferedReader in = new BufferedReader（input）; 
 String inputLine; 
 
 FileWriter outFile = new FileWriter（orhancan）; 
 PrintWriter out = new PrintWriter（outFile）; （（inputLine = in.readLine（））！= null）{
 out.println（inputLine）; 
 
 
} 
 
 in.close（）; 
 out.close（）; 
 
文件fXmlFile =新文件（orhancan）; 
 DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance（）; 
 DocumentBuilder dBuilder = dbFactory.newDocumentBuilder（）; 
 Document doc = dBuilder.parse（fXmlFile）; 
 
 
 NodeList prelist = doc.getElementsByTagName（body）; 
 System.out.println（prelist.getLength（））;

什么是问题？或者，有没有更容易的方法从一个给定的HTML标记从网站上刮取数据？ 解决方案

有一种更简单的方法去做这个。我建议使用 JSoup 。使用JSoup，您可以执行以下操作：

  Document doc = Jsoup.connect（http://en.wikipedia.org/ ）。得到（）; 
元素newsHeadlines = doc.select（＃mp-itn b a）;

或者如果您想要body：

 元素body = doc.select（body）;

或者如果您想要所有链接：

 元素链接= doc.select（body a）;

您不再需要连接或处理流。简单。如果你曾经使用jQuery，那么它非常相似。

I want to parse a simple web site and scrape information from that web site.

I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new URL("http://www.deneme.com"); URLConnection uc = url.openConnection(); InputStreamReader input = new InputStreamReader(uc.getInputStream()); BufferedReader in = new BufferedReader(input); String inputLine; FileWriter outFile = new FileWriter("orhancan"); PrintWriter out = new PrintWriter(outFile); while ((inputLine = in.readLine()) != null) { out.println(inputLine); } in.close(); out.close(); File fXmlFile = new File("orhancan"); DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder dBuilder = dbFactory.newDocumentBuilder(); Document doc = dBuilder.parse(fXmlFile); NodeList prelist = doc.getElementsByTagName("body"); System.out.println(prelist.getLength());
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?
解决方案
There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); Elements newsHeadlines = doc.select("#mp-itn b a");
Or if you want the body:
Elements body = doc.select("body");
Or if you want all links:
Elements links = doc.select("body a");
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.

这篇关于用JAVA解析网站HTML的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用JAVA解析网站HTML [英] Parse Web Site HTML with JAVA

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

用JAVA解析网站HTML [英] Parse Web Site HTML with JAVA

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭