使用 rvest 和 purrr::map_df 构建数据框:处理多元素标签 [英] using rvest and purrr::map_df to build a dataframe: dealing with multiple-element tags

查看:46
本文介绍了使用 rvest 和 purrr::map_df 构建数据框:处理多元素标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(基于我自己的问题及其@astrofunkswag 的回答 这里)

(building on my own question and its answer by @astrofunkswag here)

我正在使用 rvest 抓取网页,并使用 purrr::map_df 将收集到的数据转换为数据帧.我遇到了 map_df 只选择具有多个元素的 html 标签的第一个元素的问题.理想情况下,我希望在生成的数据框中捕获标签的所有元素,并回收元素较少的标签.

I am webscraping webpages with rvest and turning the collected data into a dataframe using purrr::map_df. I run into the problem that map_df selects only the first element of html tags with multiple elements. Ideally, I would like all elements of a tag to be captured in the resulting dataframe, and the tags with fewer elements to be recycled.

获取以下代码:

library(rvest)
library(tidyverse)

urls <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
             "https://en.wikipedia.org/wiki/Rome")
h <- urls %>% map(read_html)

out <- h %>% map_df(~{
  a <- html_nodes(., "#firstHeading") %>% html_text()
  b <- html_nodes(., ".toctext") %>% html_text()

  a <- ifelse(length(a) == 0, NA, a)
  b <- ifelse(length(b) == 0, NA, b)

  df <- tibble(a, b)
})
out

产生以下输出:

> out
# A tibble: 2 x 2
  a            b        
  <chr>        <chr>    
1 FC Barcelona History  
2 Rome         Etymology
> 

不需要这个输出,因为它只包含对应于 b 的标签的第一个元素.在源网页中,与b 相关联的元素是网页的副标题.所需的输出看起来或多或少是这样的:

This output is not desired, because it includes only the first element of the tags corresponding to b. In the source webpages, the elements associated to b are the subtitles of the webpage. The desired output looks more or less like this:

  a            b        
  <chr>        <chr>    
1 FC Barcelona History  
2 FC Barcelona  1899–1922: Beginnings  
3 FC Barcelona 1923–1957: Rivera, Republic and Civil War  
.
.
6 Rome         Etymology
7 Rome         History
8 Rome         Earliest history
.
.
> 

推荐答案

来自 ?ifelse

ifelse 返回一个与 test 形状相同的值

ifelse returns a value with the same shape as test

例如,见

ifelse(FALSE, 20, 1:5)
#[1] 1

由于length(FALSE)为1,所以只选择1:5的第一个值为1.

As the length(FALSE) is 1, only the first value of 1:5 is selected which is 1.

同样,当你在做

ifelse(length(a) == 0, NA, a)

length(length(a) == 0) 是 1,因此只返回 a 的第一个值.

length(length(a) == 0) is 1 and hence only the first value of a is returned.

在这种情况下,我们可以使用 if 而不是 ifelse 因为我们只有一个元素要检查,因为

In this case we can use if instead of ifelse since we have only one element to check because

if(FALSE) 20 else 1:5 #returns
#[1] 1 2 3 4 5

所以它会给你输出

library(tidyverse)
library(rvest)

h %>% map_df(~{
   a <- html_nodes(., "#firstHeading") %>% html_text()
   b <- html_nodes(., ".toctext") %>% html_text()
   a <- if (length(a) == 0) NA else a
   b <- if (length(b) == 0) NA else b
  tibble(a,b)
}) 


#    a            b                                        
#   <chr>        <chr>                                    
# 1 FC Barcelona History                                  
# 2 FC Barcelona 1899–1922: Beginnings                    
# 3 FC Barcelona 1923–1957: Rivera, Republic and Civil War
# 4 FC Barcelona 1957–1978: Club de Fútbol Barcelona      
# 5 FC Barcelona 1978–2000: Núñez and stabilization       
# 6 FC Barcelona The Dream Team era                       
# 7 FC Barcelona 2000–2008: Exit Núñez, enter Laporta     
# 8 FC Barcelona 2008–2012: Guardiola era                 
# 9 FC Barcelona 2014–present: Bartomeu era               
#10 FC Barcelona Support                                  
# … with 78 more rows

这篇关于使用 rvest 和 purrr::map_df 构建数据框:处理多元素标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆