最好使用Scrapy收集隐藏数据[window .__ WEB_CONTEXT __ =] ... [英] Scraping hidden data [ window.__WEB_CONTEXT__= ] ... preferably with Scrapy

查看:59
本文介绍了最好使用Scrapy收集隐藏数据[window .__ WEB_CONTEXT __ =] ...的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓捕Tripadvisor.我现在的问题是刮擦给定酒店的酒店星级(不是平均用户等级(泡沫),而是酒店等级等级),而我稍后将遇到评论隐藏在阅读更多"后面的问题. https://com.ph.tripadvisorHotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html 幸运的是我知道在哪里找到数据.它在该标签内的页面中:

I'm scraping tripadvisor. My problem is right now to scrape the Hotelstars ( not the average user rating [bubbles] but the hotel class rating) of a given hotel and I'll later run in the problem of reviews being hidden behind "read more". https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html fortunately I know where the data where to find both. It in the page within this tag:

<script window.__WEB_CONTEXT={pageManifest:{"assets":[.... 
....
</script>

https://pastebin.com/Ww3ugxFR 上搜索视图很棒!";(隐藏文字示例)或星级":"Hotelstars".

search here https://pastebin.com/Ww3ugxFR for "The view was fantastic!!" ( example of hidden text) or '"star":' for the Hotelstars.

我想学习如何访问此标签.

I want to learn how to access this tag.

这是我的工作方式示例.我需要学习如何告诉CSS选择器(或其他工具)如何解决此特定问题以及如何从中提取数据.在此示例中,我将仅加载响应并执行简单的模式搜索.我猜也可以用Json加载它并从中提取内容,但我还不支持Json.

Here my example of how it doesn't work. I need to learn how to tell CSS selector ( or another tool) how to address this specific and how to extract the data from it. Here in this example I would just load the response and do a simple pattern search. I guess one could also load it with Json and extract from there but I'm not to firm with Json yet.:

hotel_CONTEXT = response.css("script text=window.__WEB_CONTEXT ::attr(pageManifest)).extract()

pattern_hotelstar = re.compile(r'star":\["\d')
matches_hotelstar = pattern_hotelstar.findall(hotel_CONTEXT)
Hotel_stars = str(matches_hotelstar).split('"')[2].split("'")[0]

显然,我要实现的目标是使用BeautifulSoup(用隐藏在更多" 下的数据对网站进行爬网...但是,在尝试复制时出现json错误),但通常我更喜欢使用Scrapy的解决方案.

Apparently what I want to achieve is possible with BeautifulSoup ( Scraping a website with data hidden under "read more" ... however I got errors with json when trying to replicate) but generally I'd prefer a solution with Scrapy.

安德烈·凯西(Andrej Kesely)为我的问题提供了出色的解决方案!他的代码效果很好,我想完全理解它!这是我想从代码中理解的内容,而我只是不了解他的法术;):

Andrej Kesely provided an excellent solution to my problem! His code works so well that I want to fully understand it! Here is what I think to understand from the code and where I just don't understand his sorcery ;) :

data = re.search(r'window\.__WEB_CONTEXT__=(.*?});', html_text).group(1)

Andrej在整个html_text中搜索以"window .__ WEB ..."开头的模式,以非贪婪的方式将模式扩展到所有字符(.)任意次数(*)(我不明白为什么会有一个带有} init的捕获组,并且考虑到脚本以}结尾,为什么不仅仅将}放在末尾;(安德烈(Andrej)是如何发现这些的?这是一种通用的模式,还是他打印了整个页面并进行了查找?).我也不明白为什么它必须是非贪婪的.第(1)组选择了第一个离开括号的窗口中的所有内容. WEB_CONTEXT = out.我想这与用json加载结果有关.

Andrej searches the whole html_text for the pattern that starts with "window.__WEB...", extends the pattern over all characters (.), for any number of times (*) in an non-greedy way (?) and ends with a ";".I don't understand why there is a capturing group with } init and why } was not just put at the end given that the script ends with }; ( how did Andrej found this out ? is that a general pattern for these or did he print the whole page and looked it up ?). I also don't understand why it had to be non-greedy. Group(1) selected everything within the first paranthesis leaving window.WEB_CONTEXT= out. I guess this had something to do with loading the outcome with json. Same goes for

data = data.replace('pageManifest', '"pageManifest"')   

然后,Andrej创建一个名为traverse的函数,稍后将使用数据输出填充该函数.在if语句中,Andrej检查输入是否为字典.下一步,安德烈(Andrej)遍历字典的key(k)和value(v).如果k ==评论",则返回否".他产生了价值.如果不是,则从函数产生".??我也对elif迷失了,检查val是否是一个列表...一般来说,该函数的输出v是什么?我将如何更改函数以包含更多字典来进行滚动,因为此产量已经占用了其他字典.

Then Andrej creates a function called traverse that will later be filled with the output from data. In the if-statement Andrej checks whether the input is a dictionary. In a next step Andrej loops through key(k) and value(v) of the dictionary. If k=="reviews" he yields the value. If not "yield from the function" ?? I'm also lost with elif and the check whether val is a list... In general what is the output v of the function ? How would I change the function to include more dictionaries to scroll over since else is already occupied by this yield from.

def traverse(val):
if isinstance(val, dict):
    for k, v in val.items():
        if k == 'reviews':
            yield v
        else:
            yield from traverse(v)
elif isinstance(val, list):
    for v in val:
        yield from traverse(v)
 

在这里,Andrej遍历遍历(数据)(一个字典,对吗?).由于我们在此页面上有多个评论.在嵌套循环中,安德烈(Andrej)为单个评论中的每个字典指定名称r,并用dictonary_name ["key"]检索存储的值.我说的对吗?

Here Andrej loops over the traverse(data) ( a dictionary, right ?). Since we've got multiple reviews on this page. In the nested loop Andrej gives each dictionary within the single review the name r and by dictonary_name["key"] he retrieves the value which is stored. Am I right ?

for reviews in traverse(data):
  for r in reviews:
    print(r['userProfile']['displayName'])
    print(r['title'])
    print(r['text'])
    print('Rating:', r['rating'])
    print('-' * 80)

抱歉,所有这些菜鸟问题.

Sorry for all these rookie questions.

推荐答案

此脚本将打印在页面上找到的所有评论和评论评分:

This script will print all reviews and review-rating found on the page:

import re
import json
import requests


url = 'https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html'
html_text = requests.get(url).text

data = re.search(r'window\.__WEB_CONTEXT__=(.*?});', html_text).group(1)
data = data.replace('pageManifest', '"pageManifest"')
data = json.loads(data)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

def traverse(val):
    if isinstance(val, dict):
        for k, v in val.items():
            if k == 'reviews':
                yield v
            else:
                yield from traverse(v)
    elif isinstance(val, list):
        for v in val:
            yield from traverse(v)

for reviews in traverse(data):
    for r in reviews:
        print(r['userProfile']['displayName'])
        print(r['title'])
        print(r['text'])
        print('Rating:', r['rating'])
        print('-' * 80)

打印:

BBDoll619
Just WOW!!
Okay, I didn't know this resort would be mainly couples and honeymooners as I went with 2 friends. We weren't uncomfortable though and met lots of nice people from across the globe and 1 couple from the US. This resort can only be reached by boat, so it is very secluded. We stayed in bungalow #2. It was rustic, but beautiful and right on the beach. Everyone who worked in the resort was friendly and very accommodating. We ate most meals at the resort which was pretty good. We had happy hour at the pier bar every day which was from 4-7pm. They had half off certain drinks and food specials. It was very nice relaxing, enjoying a great drink and watching the sunset. You can snorkel right in front of the resort which was so cool! We snorkeled for 2 hours!! The best is right by the floating bungalows where they did massages. Speaking of massages....OMG! It was heaven!! Very affordable and different. When you lie face down, you look into a cut out in the floor, so you can view the water and fish swimming by. I loved it!! We did an island hopping tour and it was not an issue coming from this resort. When we got into Coron town and passed by all the hotels in that area, we were so glad and thankful we chose El Rio Y Mar. Coron Town is very dirty, dusty, full of young backpackers and the hotels look subpar. It's fine if you're on a budget. I get it, but us girls/mom/friends wanted to treat ourselves. That we did! One day we went on a guided hike to the top of a closeby mountain. The view was fantastic!! I highly recommend this resort and would definitely return.
Rating: 5
--------------------------------------------------------------------------------
MaricrisAndPiotr
Amazing staff
The best customer experience we ever had! the school of fishes within the resort are amazing, very quite, very clean and well maintained rooms and outdoor surroundings. Our island trip organized by them is one of the best experience we had in our Coron trip. 
Kudos to El Rio highly recommended
Rating: 5
--------------------------------------------------------------------------------

...and so on.

这篇关于最好使用Scrapy收集隐藏数据[window .__ WEB_CONTEXT __ =] ...的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆