我如何选择“所有语言"?在抓取 TripAdvisor 的评论时? [英] How do I choose "All languages" while scraping TripAdvisor's reviews?

查看:30
本文介绍了我如何选择“所有语言"?在抓取 TripAdvisor 的评论时?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Python 编程和scrapy 的新手.我一直试图从 www.tripadvisor.com 上抓取一些评论.看来,对于一些酒店,有非英语语言的评论,当我使用以下代码时,我只能得到英语语言的评分:

I am new to Python programming and scrapy as well. I have been trying to scrape some reviews from www.tripadvisor.com. It appears, for some hotels, there are reviews in non-English languages, and when I use the following code, I only get the ratings for English languages:

import scrapy
from scrapy.http import Request
import re


class ReviewScrapeSpider(scrapy.Spider):
name = 'review_scrape'
allowed_domains = ['tripadvisor.com']
start_urls = ['https://www.tripadvisor.com/Hotel_Review-g60970-d226251-Reviews-Murfreesboro_Extended_Stay_Hotel-Murfreesboro_Tennessee.html']

def parse(self, response):
    hotel_name = response.xpath('//div[@class="ui_column is-12-tablet is-10-mobile hotelDescription"]/h1[@id="HEADING"]/text()').extract_first()
    for href in response.xpath('//div[starts-with(@class,"quote")]/a/@href'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, meta={'hotel_name':hotel_name}, callback=self.parse_reviews)
    url = response.url
    if not re.findall(r'or\d', url):
        next_page = re.sub(r'(-Reviews-)', r'\g<1>or5-', url)
    else:
        pagenum = int(re.findall(r'or(\d+)-', url)[0])
        pagenum_next = pagenum + 5
        next_page = url.replace('or' + str(pagenum), 'or' + str(pagenum_next))
    yield scrapy.Request(
        next_page,
        meta={'dont_redirect': True},
        callback=self.parse)

def parse_reviews(self, response):
    rating = response.xpath('//div[@class="ui_column is-10-desktop is-12-tablet is-12-mobile"]/span[@class]').extract_first()[-11:-9]
    date = response.xpath('//*[@class="ratingDate"]/text()').extract_first()
    yield {
            'hotels_name':response.meta['hotel_name'],
            'rating':rating,
            'date': date
            }

但是,后来我意识到默认情况下评论语言设置为英语".所以每次我关闭并重新打开浏览器,然后打开酒店的评论页面时,语言都设置为英语.请检查以下图片:

However, later I realized that by default the review languages are set to "English". So every time I close and reopen my browser, and then open the reviews page of a hotel, the language is set to English. Please Check the following picture:

请点击这里

我的问题是如何在我的代码中选择所有语言"选项,以便我可以抓取所有语言的评论?我非常感谢您的帮助.我已经尝试了所有可能的方法,但无法解决问题.我期待着您的帮助.谢谢

My question is how do I select the "All languages" option in my code so that I can scrape reviews of all languages? I would highly appreciate your help. I have tried every way possible, but couldn't solve the problem. I am looking forward to your help. Thanks

推荐答案

也许答案有点晚了,但对于其他感兴趣的人:你可以很容易地通过抛出

Perhaps the answer is a bit late, but for others interested: You can easily do so by throwing

?filterLang=ALL"

"?filterLang=ALL"

在 URL 的末尾,这应该可以解决问题.

at the end of the URL and that should do the trick.

这篇关于我如何选择“所有语言"?在抓取 TripAdvisor 的评论时?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆