使用隐藏在“阅读更多信息"下的数据搜刮网站. [英] Scraping a website with data hidden under "read more"

查看:51
本文介绍了使用隐藏在“阅读更多信息"下的数据搜刮网站.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图从Tripadvisor.com上获取评论,我想在网站的阅读更多"按钮下获取数据.反正不用硒就可以刮吗?

I am trying to scrape reviews from Tripadvisor.com and I want to get the data under 'Read More' button of the site. Is there anyway to scrape this without using selenium?

到目前为止,这是我使用的代码

So far this is the code that I used

resp = requests.get('https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html#REVIEWS') 
rsp_soup = BeautifulSoup(resp.text, 'html.parser')
rsp_soup.findAll(attrs={"class": "hotels-review-list-parts-ExpandableReview__reviewText--3oMkH"})

但是它无法抓取更多内容"下的内容

But it can't scrape contents under the 'Read more'

推荐答案

评论会以html的形式部分显示,直到您单击 read more 为止,该链接实际上不会进行Ajax调用,但是会从包含的数据中更新页面在 window .__ WEB_CONTEXT __ 中.您可以通过查看显示在其中的< script> 标记来访问此数据:

Reviews are partialy revealed in html until you click on read more which actually do not make an Ajax call but updates page from data contained in window.__WEB_CONTEXT__. You can access this data by looking into a <script> tag in which it appears:

<script>
     window.__WEB_CONTEXT__={pageManifest:{"assets":["/components/dist/@ta/platform.polyfill.084d8cdf5f.js","/components/dist/runtime.56c5df2842.js", ....  }
</script>

一旦有了它,您就可以提取和处理JSON格式的数据.这是完整的代码:

Once you've got it, you and you could extract and process the data which is of JSON format. Here is the full code:

import json
from bs4 import BeautifulSoup
resp = requests.get('https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html#REVIEWS') 

data = BeautifulSoup(resp.content).find('script', text = re.compile('window.__WEB_CONTEXT__')).text

#Some text processing to make the tag content a valid json
pageManifest = json.loads(data.replace('window.__WEB_CONTEXT__=','').replace('{pageManifest:', '{"pageManifest":')[:-1])


for x in pageManifest['pageManifest']['apolloCache']:
    try:
        reviews = x['result']['locations'][0]['reviewList']['reviews']       
    except:
        pass

print([x['text'] for x in reviews])

输出

['Do arrange for airport transfers! From the airport, you will be taking a van for around 20 minutes, then you\'ll be transferred to a banca/boat for a 25 minute ride to the resort. Upon arrival, you\'ll be greeted by a band that plays their "welcome, welcome" song and in our case, we were met by Maria (awesome gal!) who introduced the group to the resort facilities and checks you in at the bar.I booked a deluxe room, which is actually a duplex with 2 adjoining rooms, ideal
for families, which accommodates 4 to a room.Rooms are clean and bed is comfortable.Potable water is provided upon check in , but is chargeable thereafter.Don\ 't worry, ...FULL REVIEW...',
 "Stayed with my wife and 2 children, 10y and 13y. ...FULL REVIEW...",
 'Beginning at now been in Coron for a couple of   ...FULL REVIEW...',
 'This was the most beautiful and relaxing place   ...FULL REVIEW...',
 'We spent 2 nights at El rio. It was incredible,  ...FULL REVIEW... ']

这篇关于使用隐藏在“阅读更多信息"下的数据搜刮网站.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆