使用 R 从 TripAdvisor 抓取数据 [英] Scraping data from TripAdvisor using R

查看:23
本文介绍了使用 R 从 TripAdvisor 抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个爬虫来从 Trip Advisor 中抓取一些数据.理想情况下,它将(a) 确定要抓取的所有位置的链接,(b) 收集每个地点所有景点的链接,并(c) 将收集所有评论的目的地名称、日期和评分.我现在想专注于 (a) 部分.

这是我开始使用的网站:http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html

这里有一个问题:该链接首先给出了前 10 个目的地,然后如果您点击查看更多热门目的地",它将展开列表.似乎它使用 javascript 函数来实现这一点.不幸的是,我不熟悉 javascript,但我认为以下内容可能会提供有关其工作原理的线索:

<img id='lazyload_2067453571_25' height='27' width='27' src='http://e2.tacdn.com/img2/x.gif'/>查看新西兰的更多热门目的地 </div>

我发现了一些有用的 R 网页抓取包,例如 rvest、RSelenium、XML、RCurl,但其中似乎只有 RSelenium 能够解决这个问题,话虽如此,我仍然无法解决解决这个问题.

这是一些相关的代码:

tu = "http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html"RSelenium::startServer()remDr = RSelenium::remoteDriver(browserName = "internet explorer")remDr$open()remDr$navigate(tu)# remDr$executeScript("JS_FUNCTION")

最后一行应该在这里起作用,但我不确定我需要在这里调用什么函数.

一旦我设法扩展了这个列表,我将能够像解决 (b) 部分一样获得每个目的地的链接,而且我想我已经解决了这个问题(对于那些感兴趣的人):

库(rvest)tu = "http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html"tu = html_session(tu)tu %>% html_nodes(xpath='//div[@class="popularCities"]/a') %>% html_attr("href")[1] "/Tourism-g255122-Queenstown_Otago_Region_South_Island-Vacations.html"[2] "/Tourism-g255106-Auckland_North_Island-Vacations.html"[3] "/Tourism-g255117-Blenheim_Marlborough_Region_South_Island-Vacations.html"[4] "/Tourism-g255111-Rotorua_Rotorua_District_Bay_of_Plenty_Region_North_Island-Vacations.html"[5] "/Tourism-g255678-Nelson_Nelson_Tasman_Region_South_Island-Vacations.html"[6] "/Tourism-g255113-Taupo_Taupo_District_Waikato_Region_North_Island-Vacations.html"[7] "/Tourism-g255109-Napier_Hawke_s_Bay_Region_North_Island-Vacations.html"[8] "/Tourism-g612500-Wanaka_Otago_Region_South_Island-Vacations.html"[9] "/Tourism-g255679-Russell_Bay_of_Islands_Northland_Region_North_Island-Vacations.html"[10] "/Tourism-g255114-Tauranga_Bay_of_Plenty_Region_North_Island-Vacations.html"

至于步骤 (c),我发现了一些可能对此有所帮助的有用链接:https://github.com/hadley/rvest/blob/master/demo/tripadvisor.Rhttp://notesofdabbler.github.io/201408_hotelReview/scrapeTripAdvisor.html

如果您有关于如何扩展热门目的地列表或如何以更明智的方式完成其他步骤的任何提示,请告诉我,我非常乐意听取您的意见.

非常感谢!

解决方案

基本上可以尝试向<div class="morePopularCities">发送点击事件.像这样:

remDr$navigate(tu)div <- remDr$findElement("class", "morePopularCities")div$clickElement()

要扩展所有位置,您可以在 while 循环中重复上述逻辑.继续点击

直到没有更多可用的项目(直到 div 不再出现在页面中):

divs <- remDr$findElements("class", "morePopularCities")而(长度(divs)> 0){for(div 中的 div ){div$clickElement()}div <- remDr$findElements("class", "morePopularCities")}

我不精通 R,您可能会发现我的代码示例不漂亮,请随时提出建议.

I want to create a crawler that will scrape some data from Trip Advisor. Ideally, it will (a) identify the links to all locations to crawl, (b) collect links to all attractions in each location and (c) will collect the destination names, dates and ratings for all reviews. I'd like to focus on part (a) for now.

Here is the website I'm starting off with: http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html

There is problem here: the link gives top 10 destinations to begin with, and if you then click on "See more popular destinations" it will expand the list. It appears as though it uses a javascript function to achieve this. Unfortunately, I'm not familiar with javascript, but I think the following chunk may give clues about how it works:

<div class="morePopularCities" onclick="ta.call('ta.servlet.Tourism.showNextChildPage', event, this)">
<img id='lazyload_2067453571_25' height='27' width='27' src='http://e2.tacdn.com/img2/x.gif'/>
See more popular destinations in New Zealand </div>

I've found a few useful webscraping packages for R, such as rvest, RSelenium, XML, RCurl, but of these, only RSelenium appears to be able to resolve this, having said that, I still haven't been able to work it out.

Here is some relevant code:

tu = "http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html"
RSelenium::startServer()
remDr = RSelenium::remoteDriver(browserName = "internet explorer")
remDr$open()
remDr$navigate(tu)
# remDr$executeScript("JS_FUNCTION")

The last line should do the trick here, but I'm not sure what function I need to call here.

Once I manage to expand this list, I will be able to obtain the links for each destination the same way I would solve part (b) and I think I've already solved this (for those interested):

library(rvest)
tu = "http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html"
tu = html_session(tu)
tu %>% html_nodes(xpath='//div[@class="popularCities"]/a') %>% html_attr("href")
 [1] "/Tourism-g255122-Queenstown_Otago_Region_South_Island-Vacations.html"                      
 [2] "/Tourism-g255106-Auckland_North_Island-Vacations.html"                                     
 [3] "/Tourism-g255117-Blenheim_Marlborough_Region_South_Island-Vacations.html"                  
 [4] "/Tourism-g255111-Rotorua_Rotorua_District_Bay_of_Plenty_Region_North_Island-Vacations.html"
 [5] "/Tourism-g255678-Nelson_Nelson_Tasman_Region_South_Island-Vacations.html"                  
 [6] "/Tourism-g255113-Taupo_Taupo_District_Waikato_Region_North_Island-Vacations.html"          
 [7] "/Tourism-g255109-Napier_Hawke_s_Bay_Region_North_Island-Vacations.html"                    
 [8] "/Tourism-g612500-Wanaka_Otago_Region_South_Island-Vacations.html"                          
 [9] "/Tourism-g255679-Russell_Bay_of_Islands_Northland_Region_North_Island-Vacations.html"      
[10] "/Tourism-g255114-Tauranga_Bay_of_Plenty_Region_North_Island-Vacations.html"  

As for step (c), I've found some useful links that might be helpful for that: https://github.com/hadley/rvest/blob/master/demo/tripadvisor.R http://notesofdabbler.github.io/201408_hotelReview/scrapeTripAdvisor.html

If you have any tips on how to expand the list of top destinations or how to go through the other steps in a smarter way, please let me know, I'd be really keen to hear from you.

Many thanks in advance!

解决方案

Basically, you can try to send a click event to the <div class="morePopularCities">. Something like this :

remDr$navigate(tu)
div <- remDr$findElement("class", "morePopularCities")
div$clickElement()

To expand all locations, you can possibly repeat the above logic in a while loop. Keep clicking on the <div> until no more items available (until the div no longer in the page) :

divs <- remDr$findElements("class", "morePopularCities")
while(length(divs )>0) {
  for(div in divs ){
    div$clickElement()
  }
  divs <- remDr$findElements("class", "morePopularCities")
}

I'm not fluent in R, you may find my code example not pretty, feel free to suggest.

这篇关于使用 R 从 TripAdvisor 抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆