rvest 包 - 如果 html_text() 找不到属性,是否可以存储 NA 值? [英] rvest package - Is it possible for html_text() to store an NA value if it does not find an attribute?
问题描述
正如标题所述,我很好奇rvest
包中的html_text()
函数是否可以存储NA
如果无法在特定页面上找到属性的值.
我目前正在抓取超过 199 页的内容(效果很好;已经对一些变量进行了测试).
目前,当我搜索仅出现在 199 个页面中的某些(136 个)页面上的值时,html_text()
仅返回 136 个字符串的向量.这没有用,因为没有 NA
s 我无法确定哪些页面包含相关变量.
我看到 html_atts()
能够接收 default
输入,但不能接收 html_text()
.有什么提示吗?
非常感谢!
如果您创建一个新函数来包装错误处理,它将使 %>%
管道更干净,更容易理解为了你未来的自己和他人:
库(rvest)html_text_na <- 函数(x, ...) {txt <- 尝试(html_text(x, ...))如果(继承(txt,尝试错误")|(length(txt)==0)) { return(NA) }返回(txt)}base_url <- "http://www.saem.org/membership/services/residency-directory?RecordID=%d"record_id <- c(1291, 1000, 1166, 1232, 999)sapply(record_id,函数(i){html(sprintf(base_url, i)) %>%html_nodes("#drpict tr:nth-child(6) .text") %>%html_text_na %>%as.numeric()})## [1] 8 不适用 10 27 不适用
此外,通过对 record_id
的向量执行 sapply
,您会自动获得一个向量,该向量是您试图提取的任何值.>
As the title states, I'm curious if it is possible for the html_text()
function from the rvest
package to store an NA
value if it is not able to find an attribute on a specific page.
I'm currently running a scrape over 199 pages (which works fine; tested on a few variables already).
Currently, when I search for a value that is only present on a some (136) of the 199 pages, html_text()
is only returning a vector of 136 strings. This is not useful because without NA
s I am unable to determine which pages contained the variable in question.
I see that html_atts()
is able to receive a default
input, but not html_text()
. Any tips?
Thank you so much!
If you create a new function to wrap error handling, it'll keep the %>%
pipe cleaner and easier to grok for your future self and others:
library(rvest)
html_text_na <- function(x, ...) {
txt <- try(html_text(x, ...))
if (inherits(txt, "try-error") |
(length(txt)==0)) { return(NA) }
return(txt)
}
base_url <- "http://www.saem.org/membership/services/residency-directory?RecordID=%d"
record_id <- c(1291, 1000, 1166, 1232, 999)
sapply(record_id, function(i) {
html(sprintf(base_url, i)) %>%
html_nodes("#drpict tr:nth-child(6) .text") %>%
html_text_na %>%
as.numeric()
})
## [1] 8 NA 10 27 NA
Also, by doing an sapply
over the vector of record_id
's you automagically get a vector back of whatever value that is you're trying to extract.
这篇关于rvest 包 - 如果 html_text() 找不到属性,是否可以存储 NA 值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!