获取与Jsoup的网站 - 网页查看源代码和Jsoup显示不同的内容 [英] Fetching the website with Jsoup - page view source and Jsoup shows different content

查看:684
本文介绍了获取与Jsoup的网站 - 网页查看源代码和Jsoup显示不同的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用Jsoup报废网站:

  DOC = Jsoup.connect(将String.valueOf(网址[0]))的userAgent(Mozilla的)获得()。

下面是链接:

<一个href=\"http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40\" rel=\"nofollow\">http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40

我已经加入 RPP = 40 参数在命令行中的链接显示,每页40结果。我能看到在页面视图源中的所有结果。
我知道Jsoup仅用于静态内容,并不能获取使用AJAX / JS库技术产生内容的网站。但是为什么Jsoup无法检索相同的内容,我可以在通过网页查看源代码浏览器中看到了什么?页面视图源显示40的结果,而Jsoup能够检索仅10个结果元素呢?我怎样才能获得通过网页查看源代码可见每一个元素。


解决方案

简短的回答 Jsoup不能执行JavaScript。

龙答案

<$p$p><$c$c>http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40

您正在查找的网页接受HTTP GET的参数。在正常的浏览器,它接受PARAMS和加载页面。但是不会与威洛布鲁克检查(在你的例子)。它加载JS它加载网页和JavaScript确实为 Fliters 的检索算法结果的复选框后。因此,当您使用Jsoup你变得更加的结果,因为它负载状态= IL'没有'柳树'过滤。

I use Jsoup to scrap the website:

doc = Jsoup.connect(String.valueOf(urls[0])).userAgent("Mozilla").get();    

Here is the link:

http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40

I have added rpp=40 parameter to the link in the command line to display 40 results per page. I'm able to see all the results in page view source. I know that Jsoup is for the static content only and cannot fetch the websites that use AJAX/JS Libraries technique to generate content. However why Jsoup cannot retrieve the same content as I can see in the browser via page view source? Page view source shows 40 results whereas Jsoup is able to retrieve elements from only 10 results? How can I obtain every elements visible via page view source.

解决方案

Short answer Jsoup can't execute the Javascript.

Long answer

http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40

The webpage your are looking for accepts the Http Get with the parameters. In the normal browser it accepts the params and loads the page . But Not with willowbrook checked(in your example). It loads the JS after it loads the page and the Javascript does the check box for Fliters the serach results. Therefore when you use Jsoup you are getting more results because it loads 'state=IL' without 'willowbrook' filtered.

这篇关于获取与Jsoup的网站 - 网页查看源代码和Jsoup显示不同的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆