R:使用 rvest 和 purrr:map_df 构建数据框:如何处理不完整的输入 [英] R: using rvest and purrr:map_df to build a data frame: how to deal with incomplete input

查看:40
本文介绍了R:使用 rvest 和 purrr:map_df 构建数据框:如何处理不完整的输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 rvest 抓取网页,并使用 purrr::map_df 将收集到的数据转换为数据帧.我遇到的问题是,并非所有网页都在我指定的每个 html_nodes 上都有内容,并且 map_df 忽略了这些不完整的网页.我希望 map_df 包含上述网页,并在 html_nodes 与内容不匹配的地方写入 NA.取以下代码:

I am webscraping webpages with rvest and turning the collected data into a dataframe using purrr::map_df. The problem I ran into is that not all webpages have content on every html_nodes that I specify, and map_df is ignoring such incomplete webpages. I would want map_df to include said webpages and write NA wherever a html_nodes does not match content. Take the following code:

library(rvest)
library(tidyverse)

urls <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
             "https://en.wikipedia.org/wiki/Rome", 
             "https://es.wikipedia.org/wiki/Curic%C3%B3")
h <- urls %>% map(read_html)

out <- h %>% map_df(~{
  a <- html_nodes(., "#firstHeading") %>% html_text()
  b <- html_nodes(., "#History") %>% html_text()
  df <- tibble(a, b)
})
out

输出如下:

> out
# A tibble: 2 x 2
  a            b      
  <chr>        <chr>  
1 FC Barcelona History
2 Rome         History

这里的问题是输出数据帧不包含与 #History html 节点(在本例中为第三个 url)不匹配的网站的行.我想要的输出如下所示:

The problem here is that the output dataframe does not contain rows for websites which have not match for the #History html node (in this case, the third url). My desired output, looks like this:

> out
# A tibble: 2 x 3
  a            b      
  <chr>        <chr>  
1 FC Barcelona History
2 Rome         History
3 Curicó       NA

任何帮助将不胜感激!

推荐答案

您可以只检查 map_df 部分.由于 html_nodes 在它不存在时返回 character(0),检查 ab

You can just check in the map_df portion. Since html_nodes returns character(0) when it's not there, check the lengths of a and b

out <- h %>% map_df(~{
  a <- html_nodes(., "#firstHeading") %>% html_text()
  b <- html_nodes(., "#History") %>% html_text()

  a <- ifelse(length(a) == 0, NA, a)
  b <- ifelse(length(b) == 0, NA, b)

  df <- tibble(a, b)
})
out

# A tibble: 3 x 2
  a            b      
  <chr>        <chr>  
1 FC Barcelona History
2 Rome         History
3 Curicó       NA   

这篇关于R:使用 rvest 和 purrr:map_df 构建数据框:如何处理不完整的输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆