rvest 包 - 如果 html_text() 找不到属性,是否可以存储 NA 值? [英] rvest package - Is it possible for html_text() to store an NA value if it does not find an attribute?

查看:63
本文介绍了rvest 包 - 如果 html_text() 找不到属性,是否可以存储 NA 值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如标题所述,我很好奇rvest 包中的html_text() 函数是否可以存储NA 如果无法在特定页面上找到属性的值.

我目前正在抓取超过 199 页的内容(效果很好;已经对一些变量进行了测试).

目前,当我搜索仅出现在 199 个页面中的某些(136 个)页面上的值时,html_text() 仅返回 136 个字符串的向量.这没有用,因为没有 NAs 我无法确定哪些页面包含相关变量.

我看到 html_atts() 能够接收 default 输入,但不能接收 html_text().有什么提示吗?

非常感谢!

解决方案

如果您创建一个新函数来包装错误处理,它将使 %>% 管道更干净,更容易理解为了你未来的自己和他人:

库(rvest)html_text_na <- 函数(x, ...) {txt <- 尝试(html_text(x, ...))如果(继承(txt,尝试错误")|(length(txt)==0)) { return(NA) }返回(txt)}base_url <- "http://www.saem.org/membership/services/residency-directory?RecordID=%d"record_id <- c(1291, 1000, 1166, 1232, 999)sapply(record_id,函数(i){html(sprintf(base_url, i)) %>%html_nodes("#drpict tr:nth-child(6) .text") %>%html_text_na %>%as.numeric()})## [1] 8 不适用 10 27 不适用

此外,通过对 record_id 的向量执行 sapply,您会自动获得一个向量,该向量是您试图提取的任何值.

As the title states, I'm curious if it is possible for the html_text() function from the rvest package to store an NA value if it is not able to find an attribute on a specific page.

I'm currently running a scrape over 199 pages (which works fine; tested on a few variables already).

Currently, when I search for a value that is only present on a some (136) of the 199 pages, html_text() is only returning a vector of 136 strings. This is not useful because without NAs I am unable to determine which pages contained the variable in question.

I see that html_atts() is able to receive a default input, but not html_text(). Any tips?

Thank you so much!

解决方案

If you create a new function to wrap error handling, it'll keep the %>% pipe cleaner and easier to grok for your future self and others:

library(rvest)

html_text_na <- function(x, ...) {

  txt <- try(html_text(x, ...))
  if (inherits(txt, "try-error") |
      (length(txt)==0)) { return(NA) }
  return(txt)

}

base_url <- "http://www.saem.org/membership/services/residency-directory?RecordID=%d"

record_id <- c(1291, 1000, 1166, 1232, 999)

sapply(record_id, function(i) {

  html(sprintf(base_url, i)) %>% 
    html_nodes("#drpict tr:nth-child(6) .text") %>%
    html_text_na %>%
    as.numeric()

})

## [1]  8 NA 10 27 NA

Also, by doing an sapply over the vector of record_id's you automagically get a vector back of whatever value that is you're trying to extract.

这篇关于rvest 包 - 如果 html_text() 找不到属性,是否可以存储 NA 值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆