Rvest 抓取错误 [英] Rvest scraping errors

查看:38
本文介绍了Rvest 抓取错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我正在运行的代码

library(rvest)

rootUri <- "https://github.com/rails/rails/pull/"
PR <- as.list(c(100, 200, 300))
list <- paste0(rootUri, PR)
messages <- lapply(list, function(l) {
  html(l)
})

到目前为止它似乎工作正常,但是当我尝试提取文本时:

Up until this point it seems to work fine, but when I try to extract the text:

html_text(messages)

我明白了:

Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : 
  Unknown input of class: list

尝试提取特定元素:

html_text(messages[1])

也不能这样做...

Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) : 
  Unknown input of class: list

所以我尝试了一种不同的方式:

So I try a different way:

html_text(messages[[1]])

这似乎至少得到了数据,但仍然不成功:

This seems to at least get at the data, but is still not succesful:

Error in UseMethod("xmlValue") : 
  no applicable method for 'xmlValue' applied to an object of class "c('HTMLInternalDocument',     'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument')"

如何从列表的每个元素中提取文本材料?

How can I extract the text material from each of the elements of my list?

推荐答案

您的代码有两个问题.在此处查看有关如何使用该软件包的示例.

1.您不能将所有功能都用于所有功能.

  • html() 用于下载内容
  • html_node() 用于从页面的下载内容中选择节点
  • html_text() 用于从先前选择的节点中提取文本
  • html() is for download of content
  • html_node() is for selecting node(s) from the downloaded content of a page
  • html_text() is for extracting text from a previously selected node

因此,要下载您的页面之一并提取 html 节点的文本,请使用:

Therefore, to download one of your pages and extract the text of the html-node, use this:

library(rvest)

老派风格:

url          <- "https://github.com/rails/rails/pull/100"
url_content  <- html(url)
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text

...或者这个...

难以阅读的老式风格:

url_mainnode_text  <- html_text(html_node(html("https://github.com/rails/rails/pull/100"), "*"))
url_mainnode_text

...或者这个...

ma​​gritr-piping 风格

url_mainnode_text  <- 
  html("https://github.com/rails/rails/pull/100") %>%
  html_node("*") %>%
  html_text()
url_mainnode_text

2.使用列表时,您必须将函数应用于列表,例如lapply()

如果您想批量处理几个 URL,您可以尝试这样的操作:

If you want to kind of batch-process several URLs you can try something like this:

  url_list    <- c("https://github.com/rails/rails/pull/100", 
                   "https://github.com/rails/rails/pull/200", 
                   "https://github.com/rails/rails/pull/300")

  get_html_text <- function(url, css_or_xpath="*"){
      html_text(
        html_node(
          html("https://github.com/rails/rails/pull/100"), css_or_xpath
        )
      )
   }

lapply(url_list, get_html_text, css_or_xpath="a[class=message]")

这篇关于Rvest 抓取错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆