麻烦从页面刮 [英] Trouble in scraping from a page

查看:168
本文介绍了麻烦从页面刮的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

指的是我的一个<一href=\"http://stackoverflow.com/questions/27134612/unable-to-get-actual-markup-from-a-page-with-beautifulsoup\">$p$pvious问题,我也凑酒店的意见(所有评论),例如本<一个href=\"http://www.starwoodhotels.com//sheraton/property/reviews/index.html?language=en_US&propertyID=115\"相对=nofollow>酒店

Refering to the one of my previous question, I have to scrape reviews(all reviews) of a hotel, for example this hotel

通过使用 BeautifulSoap ,我做了什么,我第一次得到分页上课的div BVRRPager BVRRPageBasedPager ,然后刮去所有页面的评论。
与BeautifulSoap的问题是,在内容 div.BVRRRatingSummary 没有出现(尝试loaing与该页面JS禁用)

With using BeautifulSoap, what I have done that I first get all the review pages links from pagination within the div having class BVRRPager BVRRPageBasedPager, and then scrape reviews from all pages. Problem with BeautifulSoap is that the content in div.BVRRRatingSummary does not come along(try loaing that page with JS disabled)

我已经用刮的Selinium审查,但我的客户不希望使用Selinium因为它加载整个页面用JS和图像

I have scraped the reviews using Selinium but my client does not want to use Selinium because it loads full page with JS and images

我想知道什么样的过程中,他们可能会使用加载评论?而有什么办法,我可以用刮 BeautifulSoap

I want to know that what kind of process they might be using to load review? And is there any way I can scrape the content in div.BVRRRatingSummary with BeautifulSoap?

推荐答案

您可以尝试使用火狐的插件萤火虫。加载网页时打开萤火虫和去净,然后单击XHR。这将显示正在加载哪个JSON文件。然后,您可以尝试直接获取这些文件并使用图书馆像simplejson的工作。

You could try using firefox with the firebug addon. Open up firebug when loading the webpage and go to Net and then click on XHR. That will show you which json files are being loaded. You can then try to get those files directly and work with those using a library like simplejson.

这篇关于麻烦从页面刮的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆