rvest 错误:“类中的错误(输出)<-“XMLNodeSet": 尝试将属性设置为 NULL" [英] rvest error: &quot;Error in class(out) &lt;- &quot;XMLNodeSet&quot; : attempt to set an attribute on NULL&quot;

查看:32
本文介绍了rvest 错误:“类中的错误(输出)<-“XMLNodeSet": 尝试将属性设置为 NULL"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用新的 rvest 包抓取一组网页.它适用于大多数网页,但当没有特定字母的表格条目时,会返回错误.

I'm trying to scrape a set of web pages with the new rvest package. It works for most of the web pages but when there are no tabular entries for a particular letter, an error is returned.

# install the packages you need, as appropriate
install.packages("devtools")
library(devtools)
install_github("hadley/rvest")
library(rvest)

此代码运行正常,因为网页上有字母 E 的条目.

This code works OK because there are entries for the letter E on the web page.

# works OK
url <- "https://www.propertytaxcard.com/ShopHillsborough/participants/alph/E"
pg <- html_session(url, user_agent("Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"))
pg %>% html_nodes(".sponsor-info .bold") %>% html_text()

这不起作用,因为网页上没有字母 F 的条目.错误消息是类中的错误(输出)<-XMLNodeSet":尝试将属性设置为 NULL"

This doesn't work because there are no entries for the letter F on the web page. The error message is "Error in class(out) <- "XMLNodeSet" : attempt to set an attribute on NULL"

# yields error message
url <- "https://www.propertytaxcard.com/ShopHillsborough/participants/alph/F"
pg <- html_session(url, user_agent("Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0"))   
pg %>% html_nodes(".sponsor-info .bold") %>% html_text()    

任何建议.提前致谢.

推荐答案

你总是可以将 pg...html_nodes...html_text 包装在 html_textcode>try 然后测试类:

You could always wrap the pghtml_nodeshtml_text in try and test for the class afterwards:

tmp <- try(pg %>% html_nodes(".sponsor-info .bold") %>% html_text(), silent=TRUE)

if (class(tmp) == "character") {
  print("do stuff")
} else {
  print("do other stuff")
}

另一种选择是使用 boolean() XPath 运算符并以这种方式进行测试:

one other option is to use the boolean() XPath operator and do the test that way:

html_nodes_exist <- function(rvest_session, xpath) {

  xpathApply(content(rvest_session$response, as="parsed"), 
             sprintf("boolean(%s)", xpath))

}

pg %>% html_nodes_exist("//td[@class='sponsor-info']/span[@class='bold']")

如果这些节点存在,它将返回 TRUE,如果它们不存在,则返回 FALSE(该函数需要泛化才能使用 sessioncode> 和 ["HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" "XMLAbstractDocument"] 对象并使用 CSS 选择器和 XPath,但这是避免 try 的一种方法.

which will return TRUE if those nodes exist and FALSE if they don't (that function needs to be generalized to be able to use session and ["HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" "XMLAbstractDocument"] objects and work with both CSS selectors as well as XPath, but it's a way to avoid try.

这篇关于rvest 错误:“类中的错误(输出)<-“XMLNodeSet": 尝试将属性设置为 NULL"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆