我如何选择“所有语言"?在抓取 TripAdvisor 的评论时? [英] How do I choose "All languages" while scraping TripAdvisor's reviews?
问题描述
我是 Python 编程和scrapy 的新手.我一直试图从 www.tripadvisor.com 上抓取一些评论.看来,对于一些酒店,有非英语语言的评论,当我使用以下代码时,我只能得到英语语言的评分:
I am new to Python programming and scrapy as well. I have been trying to scrape some reviews from www.tripadvisor.com. It appears, for some hotels, there are reviews in non-English languages, and when I use the following code, I only get the ratings for English languages:
import scrapy
from scrapy.http import Request
import re
class ReviewScrapeSpider(scrapy.Spider):
name = 'review_scrape'
allowed_domains = ['tripadvisor.com']
start_urls = ['https://www.tripadvisor.com/Hotel_Review-g60970-d226251-Reviews-Murfreesboro_Extended_Stay_Hotel-Murfreesboro_Tennessee.html']
def parse(self, response):
hotel_name = response.xpath('//div[@class="ui_column is-12-tablet is-10-mobile hotelDescription"]/h1[@id="HEADING"]/text()').extract_first()
for href in response.xpath('//div[starts-with(@class,"quote")]/a/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, meta={'hotel_name':hotel_name}, callback=self.parse_reviews)
url = response.url
if not re.findall(r'or\d', url):
next_page = re.sub(r'(-Reviews-)', r'\g<1>or5-', url)
else:
pagenum = int(re.findall(r'or(\d+)-', url)[0])
pagenum_next = pagenum + 5
next_page = url.replace('or' + str(pagenum), 'or' + str(pagenum_next))
yield scrapy.Request(
next_page,
meta={'dont_redirect': True},
callback=self.parse)
def parse_reviews(self, response):
rating = response.xpath('//div[@class="ui_column is-10-desktop is-12-tablet is-12-mobile"]/span[@class]').extract_first()[-11:-9]
date = response.xpath('//*[@class="ratingDate"]/text()').extract_first()
yield {
'hotels_name':response.meta['hotel_name'],
'rating':rating,
'date': date
}
但是,后来我意识到默认情况下评论语言设置为英语".所以每次我关闭并重新打开浏览器,然后打开酒店的评论页面时,语言都设置为英语.请检查以下图片:
However, later I realized that by default the review languages are set to "English". So every time I close and reopen my browser, and then open the reviews page of a hotel, the language is set to English. Please Check the following picture:
我的问题是如何在我的代码中选择所有语言"选项,以便我可以抓取所有语言的评论?我非常感谢您的帮助.我已经尝试了所有可能的方法,但无法解决问题.我期待着您的帮助.谢谢
My question is how do I select the "All languages" option in my code so that I can scrape reviews of all languages? I would highly appreciate your help. I have tried every way possible, but couldn't solve the problem. I am looking forward to your help. Thanks
推荐答案
也许答案有点晚了,但对于其他感兴趣的人:你可以很容易地通过抛出
Perhaps the answer is a bit late, but for others interested: You can easily do so by throwing
?filterLang=ALL"
"?filterLang=ALL"
在 URL 的末尾,这应该可以解决问题.
at the end of the URL and that should do the trick.
这篇关于我如何选择“所有语言"?在抓取 TripAdvisor 的评论时?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!