R:使用 rvest 和 purrr:map_df 构建数据框:如何处理不完整的输入 [英] R: using rvest and purrr:map_df to build a data frame: how to deal with incomplete input
问题描述
我正在使用 rvest
抓取网页,并使用 purrr::map_df
将收集到的数据转换为数据帧.我遇到的问题是,并非所有网页都在我指定的每个 html_nodes
上都有内容,并且 map_df
忽略了这些不完整的网页.我希望 map_df
包含上述网页,并在 html_nodes
与内容不匹配的地方写入 NA
.取以下代码:
I am webscraping webpages with rvest
and turning the collected data into a dataframe using purrr::map_df
. The problem I ran into is that not all webpages have content on every html_nodes
that I specify, and map_df
is ignoring such incomplete webpages. I would want map_df
to include said webpages and write NA
wherever a html_nodes
does not match content. Take the following code:
library(rvest)
library(tidyverse)
urls <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
"https://en.wikipedia.org/wiki/Rome",
"https://es.wikipedia.org/wiki/Curic%C3%B3")
h <- urls %>% map(read_html)
out <- h %>% map_df(~{
a <- html_nodes(., "#firstHeading") %>% html_text()
b <- html_nodes(., "#History") %>% html_text()
df <- tibble(a, b)
})
out
输出如下:
> out
# A tibble: 2 x 2
a b
<chr> <chr>
1 FC Barcelona History
2 Rome History
这里的问题是输出数据帧不包含与 #History
html 节点(在本例中为第三个 url)不匹配的网站的行.我想要的输出如下所示:
The problem here is that the output dataframe does not contain rows for websites which have not match for the #History
html node (in this case, the third url). My desired output, looks like this:
> out
# A tibble: 2 x 3
a b
<chr> <chr>
1 FC Barcelona History
2 Rome History
3 Curicó NA
任何帮助将不胜感激!
推荐答案
您可以只检查 map_df
部分.由于 html_nodes
在它不存在时返回 character(0)
,检查 a
和 b
You can just check in the map_df
portion. Since html_nodes
returns character(0)
when it's not there, check the lengths of a
and b
out <- h %>% map_df(~{
a <- html_nodes(., "#firstHeading") %>% html_text()
b <- html_nodes(., "#History") %>% html_text()
a <- ifelse(length(a) == 0, NA, a)
b <- ifelse(length(b) == 0, NA, b)
df <- tibble(a, b)
})
out
# A tibble: 3 x 2
a b
<chr> <chr>
1 FC Barcelona History
2 Rome History
3 Curicó NA
这篇关于R:使用 rvest 和 purrr:map_df 构建数据框:如何处理不完整的输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!