无法使用Jsoup解析网址的完整html [英] not able to parse complete html of a url using Jsoup

查看:64
本文介绍了无法使用Jsoup解析网址的完整html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Jsoup库未解析给定URL的完整html. url的原始html中缺少一些分隔.

有趣的事情: http://facebook.com/search .php?init = s:email& q=somebody@gmail.com& type = users

如果您在Jsoup的官方网站 http://try.jsoup.org/中提供上述网址, 通过提取正确显示了url的确切html,但是使用jsoup库在程序中找不到相同的结果.

这是我的Java代码:

String url="http://facebook.com/search.php?init=s:email&q=somebody@gmail.com&type=users";

Document document = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36").get();

String question =document.toString();
System.out.println(" whole content: "+question);

明确提及了在其官方网站中使用的正确的userAgent 但是结果是,我可以看到70%的原始html代码,但是在中间,我以某种方式找不到几个除法标记,这些除法标记具有我想要的数据.

我尝试过尝试.....没用...为什么文档中缺少几个div标签.

您可以直接将URL放入浏览器中,如果您登录到Facebook,则可以看到以下响应:未找到查询结果. 检查您的拼写或尝试另一个术语.这是我在jsoup解析上述URL的html时要寻找的.

但是不幸的是,这部分丢失了.实际上,此响应位于div ID中:#pagelet_search_no_results".我在解析的html中找不到具有此ID的div.我尝试了jsoup提供的许多方法,但是没有运气.

解决方案

您还应该设置较大的超时时间,例如:

Document document = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();

Jsoup library is not parsing complete html of a given url. some divisions are missing from the orignial html of url.

Interesting thing: http://facebook.com/search.php?init=s:email&q=somebody@gmail.com&type=users

if you give url mentioned above in Jsoup's official site http://try.jsoup.org/ it is correctly showing the exact html of the url by fetching, but the same result cant be found in the program using jsoup library.

here is my java code:

String url="http://facebook.com/search.php?init=s:email&q=somebody@gmail.com&type=users";

Document document = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36").get();

String question =document.toString();
System.out.println(" whole content: "+question);

clearly mentioned correct userAgent which is being used in their official site but, in the result, i can see 70% of the original html code, but in the middle somehow i couldn't find few division tags, which is having my desired data.

i tried tried..... no use... why few div tags are missing from the doc.

you can directly take the url and put it into your browser, if you are logged into facebook, you can see the response as: " No results found for your query. Check your spelling or try another term." this is what i am looking for when jsoup parse html of the above mentioned url.

But unfortunately, this part is missing.actually this response is in div id: "#pagelet_search_no_results". i could not find the div with this id in the parsed html. I tried with lot of methods available from jsoup, but no luck.

解决方案

You should also set a large timeout, ex.:

Document document = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();

这篇关于无法使用Jsoup解析网址的完整html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆