RVEST 包似乎以随机顺序收集数据 [英] RVEST package seems to collect data in random order

查看:57
本文介绍了RVEST 包似乎以随机顺序收集数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下问题.

我正在尝试从 Booking 网站收集数据(仅对我而言,以便了解 rvest 包的功能).一切都很好,包似乎收集了我想要的东西并将所有东西放在表中(数据框).这是我的代码:

I am trying to harvest data from the Booking website (for me only, in order to learn the functionality of the rvest package). Everything's good and fine, the package seems to collect what I want and to put everything in the table (dataframe). Here's my code:

library(rvest)
library(lubridate)
library(tidyverse)

page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
  paste0(1:60) %>%
  paste0(c("?ie=UTF8&pageNumber=")) %>%
  paste0(1:60) %>%
  paste0(c("&pageSize=10&sortBy=recent"))

所以在这个块中,我首先从我选择的国家(西班牙)手动输入 Booking 搜索引擎、我感兴趣的日期(只是一些任意间隔)和数量后从前 60 页收集数据人们(我在这里使用了默认值).

so in this chunk I collect the data from the first 60 pages after first manually feeding the Booking search engine with the country of my choise (Spain), the dates I am interested in (just some arbitrary interval) and the number of people (I used defaults here).

然后,我添加此代码以选择我想要的属性:

Then, I add this code to select the properties I want:

read_hotel <- function(url){  # collecting hotel names
  ho <- read_html(url)
  headline <- ho %>%
    html_nodes("span.sr-hotel__name") %>%  # the node I want to read
    html_text() %>%
    as_tibble()
} 

hotels <- map_dfr(page_booking, read_hotel)

read_pr <- function(url){    # collecting price tags
  pr <- read_html(url)
  full_pr <- pr %>%
    html_nodes("div.bui-price-display__value") %>% #the node I want to read
   html_text() %>%
    as_tibble()
}

fullprice <- map_dfr(page_booking, read_pr)

...并最终将整个数据保存在数据框中:

... and eventually save the whole data in the dataframe:

dfr <- tibble(hotels = hotels,
             price_fact =  fullprice)

我收集了更多参数,但这无关紧要.然后创建 1500 行和两列的最终数据框.但问题是第二列中的数据与第一列中的数据不对应.这真的很奇怪,使我的数据框变得无用.我真的不明白包在后台是如何工作的,以及它为什么会这样.我还注意到数据框第一列中的第一行(酒店名称)与我在网站上看到的第一家酒店不符.所以这似乎是 rvest 包使用的不同搜索/排序/过滤条件.你能解释一下在 rvest 节点希望期间发生的过程吗?我真的很感激至少有一些解释,只是为了更好地理解我们使用的工具.

I collect more parameters but this doesn't matter. The final dataframe of 1500 rows and two columns is then created. But the problem is the data within the second column does not correspond to the data in the first one. Which is really strange and renders my dataframe to be useless. I don't really understand how the package works in the background and why does it behaves that way. I also paid attention the first rows in the first column of the dataframe (hotel name) do not correspond to the first hotels I see on the website. So it seems to be a different search/sort/filter criteria the rvest package uses. Could you please explain me the processes take place during the rvest node hoping? I would really appreciate at least some explanation, just to better understand the tool we work with.

推荐答案

你不应该像那样把酒店的名字和价格分开.您应该做的是获取项目(酒店)的所有节点,然后抓取每个酒店的相对名称和价格.用这种方法,你不能乱序.

You shouldn't scrape hotels' name and price separately like that. What you should do is get all nodes of items (hotels), then scrape the name and price relatively of each hotel. With this method, you can't mess up the order.

library(rvest)
library(purrr)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
  paste0(1:60) %>%
  paste0(c("?ie=UTF8&pageNumber=")) %>%
  paste0(1:60) %>%
  paste0(c("&pageSize=10&sortBy=recent"))


hotels <- 
  map_dfr(
    page_booking,
    function(url) {
      pg <- read_html(url)
      items <- pg %>%
        html_nodes(".sr_item")
      map_dfr(
        items,
        function(item) {
          data.frame(
            hotel = item %>% html_node(xpath = "./descendant::*[contains(@class,'sr-hotel__name')]") %>% html_text(trim = T),
            price = item %>% html_node(xpath = "./descendant::*[contains(@class,'bui-price-display__value')]") %>% html_text(trim = T)
          )
        }
      )
    }
  )

(点开始的 XPath 语法表示当前节点,即酒店项目.)

(The dots start the XPath syntaxes present the current node which is the hotel item.)

更新:更新我认为更快但仍然可以完成工作的代码:

Update: Update the code that I think faster but still does the job:

hotels <-
  map_dfr(
    page_booking,
    function(url) {
      pg <- read_html(url)
      items <- pg %>%
        html_nodes(".sr_item")
      data.frame(
        hotel = items %>% html_node(xpath = "./descendant::*[contains(@class,'sr-hotel__name')]") %>% html_text(trim = T),
        price = items %>% html_node(xpath = "./descendant::*[contains(@class,'bui-price-display__value')]") %>% html_text(trim = T)
      )
    }
  )

这篇关于RVEST 包似乎以随机顺序收集数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆