如果网页的大小很大，JSOUP不会下载完整的html。任何替代这个或任何解决方法？ [英] JSOUP not downloading complete html if the webpage is big in size. Any alternatives to this or any workarounds?

查看：136 发布时间：2018/6/20 15:36:50 java html html-parsing jsoup

本文介绍了如果网页的大小很大，JSOUP不会下载完整的html。任何替代这个或任何解决方法？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图获取HTML页面并解析信息。我刚刚发现一些页面并未使用 Jsoup 完全下载。我在命令行检查了 curl 命令，然后下载完整的页面。最初我以为它是网站特定的，但是我只是试图用 Jsoup 随机解析任何大网页，发现它没有下载整个网页。我试着指定用户代理并超时，但它仍然无法下载。这里是我试过的代码：

I was trying to get the HTML page and parse information. I just found out that some of the pages were not completely downloaded using Jsoup. I checked with curl command on command line then the complete page got downloaded. Initially I thought that it was site specific, but then I just tried to parse any big webpage randomly using Jsoup and found that it didn't download the complete webpage. I tried specifying user agent and time out properties still it failed to download. Here is the code I tried:

import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.io.UnsupportedEncodingException; import java.net.MalformedURLException; import java.net.URL; import java.util.HashSet; import java.util.Set; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.select.Elements; public class JsoupTest { public static void main(String[] args) throws MalformedURLException, UnsupportedEncodingException, IOException { String urlStr = "http://en.wikipedia.org/wiki/List_of_law_clerks_of_the_Supreme_Court_of_the_United_States"; URL url = new URL(urlStr); String content = ""; try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"))) { for (String line; (line = reader.readLine()) != null;) { content += line; } } String article1 = Jsoup.connect(urlStr).get().text(); String article2 = Jsoup.connect(urlStr).userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6").referrer("http://www.google.com").timeout(30000).execute().parse().text(); String article3 = Jsoup.parse(content).text(); System.out.println("ARTICLE 1 : "+article1); System.out.println("ARTICLE 2 : "+article2); System.out.println("ARTICLE 3 : "+article3); } }

在第1条和第2条中，当我使用Jsoup连接到网站我没有得到完整的信息，但在使用 URL 连接时，我获得了完整的页面。所以基本上第3条是完整的，它使用 URL 完成。我尝试了 Jsoup 1.8.1 和 Jsoup 1.7.2

In Article 1 and 2 when I am using Jsoup to connect to the website I am not getting complete info, but while using URL to connect I am getting the complete Page. So basically Article 3 is complete which was done using URL. I have tried with Jsoup 1.8.1 and Jsoup 1.7.2

推荐答案

使用方法 maxBodySize ：

String article = Jsoup.connect(urlStr).maxBodySize(Integer.MAX_VALUE).get().text();

这篇关于如果网页的大小很大，JSOUP不会下载完整的html。任何替代这个或任何解决方法？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如果网页的大小很大，JSOUP不会下载完整的html。任何替代这个或任何解决方法？ [英] JSOUP not downloading complete html if the webpage is big in size. Any alternatives to this or any workarounds?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如果网页的大小很大，JSOUP不会下载完整的html。任何替代这个或任何解决方法？ [英] JSOUP not downloading complete html if the webpage is big in size. Any alternatives to this or any workarounds?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭