在 R 中抓取网页时如何摆脱错误? [英] How to get rid of the error while scraping web in R?

查看:43
本文介绍了在 R 中抓取网页时如何摆脱错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取这个网站和得到一条错误消息是 tibble 列必须具有兼容的大小.
这种情况我该怎么办?

I'm scraping this website and get an error message is tibble columns must have compatible sizes.
What should I do in this case?

library(rvest)
library(tidyverse)

url <- "https://www.zomato.com/tr/toronto/drinks-and-nightlife?page=5"
map_dfr(
  .x = url,
  .f = function(x) {
    tibble(
      url = x,
      place = read_html(x) %>%
        html_nodes("a.result-title.hover_feedback.zred.bold.ln24.fontsize0") %>%
        html_attr("title"),
      price = read_html(x) %>%
        html_nodes("div.res-cost.clearfix span.col-s-11.col-m-12.pl0") %>%
        html_text()
    )
  }
) -> df_zomato

提前致谢.

推荐答案

问题是因为每个餐厅都没有完整的记录.在这个例子中,列表中的第 13 个项目不包含价格,因此价格向量有 14 个项目,而地点向量有 15 个项目.

The problem is due to every restaurant not having a complete record. In this example the 13th item on the list did not include the price, thus the price vector had 14 items while the place vector had 15 items.

解决此问题的一种方法是找到公共父节点,然后使用html_node() 函数解析这些节点.html_node() 将始终返回一个值,即使它是 NA.

One way to solve this problem is to find the common parent node and then parse those nodes with the html_node() function. html_node() will always return a value even if it is NA.

library(rvest)
library(dplyr)
library(tibble)


url <- "https://www.zomato.com/tr/toronto/drinks-and-nightlife?page=5"
readpage <- function(url){
   #read the page once
   page <-read_html(url)

   #parse out the parent nodes
   results <- page %>% html_nodes("article.search-result")

   #retrieve the place and price from each parent
   place <- results %>% html_node("a.result-title.hover_feedback.zred.bold.ln24.fontsize0") %>%
      html_attr("title")
   price <- results %>% html_node("div.res-cost.clearfix span.col-s-11.col-m-12.pl0") %>%
      html_text()

   #return a tibble/data,frame
   tibble(url, place, price)
}

readpage(url)

另请注意,在上面的代码示例中,您多次阅读同一页面.这很慢并且会给服务器带来额外的负载.这可以被视为拒绝服务"攻击.
最好先将页面读入内存,然后再使用该副本.

Also note in your code example above, you were reading the same page multiple times. This is slow and puts additional load on the server. This could be view as a "denial of service" attack.
It is best to read the page once into memory and then work with that copy.

更新
回答您关于多页的问题.将上述函数包装在一个 lapply 函数中,然后绑定返回的数据帧(或小标题)列表

Update
To answer your question concerning multiple pages. Wrap the above function in a lapply function and then bind the list of returned data frames (or tibbles)

dfs <- lapply(listofurls, function(url){ readpage(url)})
finalanswer <- bind_rows(dfs)

这篇关于在 R 中抓取网页时如何摆脱错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆