从 read_html 处理对空网页的错误响应 [英] Handling error response to empty webpage from read_html

查看:33
本文介绍了从 read_html 处理对空网页的错误响应的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试抓取网页标题,但遇到名为tweg.com"的网站的问题

Trying to scrape a web page title but running into a problem with a website called "tweg.com"

library(httr)
library(rvest)
page.url <- "tweg.com"
page.get <- GET(page.url) # from httr
pg <- read_html(page.get) # from rvest
page.title <- html_nodes(pg, "title") %>% 
  html_text() # from rvest

read_html 停止并显示错误消息:错误:解析文本失败".查看page.get$content,发现它是空的(raw(0)).

read_html stops with an error message: "Error: Failed to parse text". Looking into page.get$content, find that it is empty (raw(0)).

当然,可以编写一个简单的检查来考虑到这一点并避免使用 read_html 进行解析.但是,感觉更优雅的解决方案是从 read_html 返回一些内容,然后基于它返回一个空页面标题(即").尝试将选项"传递给 read_html,例如 RECOVER、NOERROR 和 NOBLANKS,但没有成功.任何想法如何从 read_html 取回空页面"响应?

Certainly, can write a simple check to take this into account and avoid parsing using read_html. However, feel that a more elegant solution would be to get something back from read_html and then based on it return an empty page title (i.e., ""). Tried passing "options" to read_html, such as RECOVER, NOERROR and NOBLANKS, but no success. Any ideas how to get back "empty page" response from read_html?

推荐答案

您可以使用 tryCatch 来捕获错误并返回特定的内容(只需 try(read_html('http://tweg.com'), silent = TRUE) 如果您只想返回错误并继续).您需要向 tryCatch 传递一个函数,用于在捕获错误时返回什么,您可以根据需要构建该函数.

You can use tryCatch to catch errors and return something in particular (just try(read_html('http://tweg.com'), silent = TRUE) will work if you just want to return the error and continue). You'll need to pass tryCatch a function for what to return when error is caught, which you can structure as you like.

library(rvest)


tryCatch(read_html('http://tweg.com'), 
         error = function(e){'empty page'})    # just return "empty page"
#> [1] "empty page"

tryCatch(read_html('http://tweg.com'), 
         error = function(e){list(result = 'empty page', 
                                  error = e)})    # return error too
#> $result
#> [1] "empty page"
#> 
#> $error
#> <Rcpp::exception in eval(substitute(expr), envir, enclos): Failed to parse text>

purrr 包还包含两个函数 possablysafely 做同样的事情,但接受更灵活的函数定义.请注意,它们是副词,因此返回一个仍然必须调用的函数,这就是 URL 在调用后的括号中的原因.

The purrr package also contains two functions possibly and safely that do the same thing, but accept more flexible function definitions. Note that they are adverbs, and thus return a function that still must be called, which is why the URL is in parentheses after the call.

library(purrr)

possibly(read_html, 'empty page')('http://tweg.com')
#> [1] "empty page"

safely(read_html, 'empty page')('http://tweg.com')
#> $result
#> [1] "empty page"
#> 
#> $error
#> <Rcpp::exception in eval(substitute(expr), envir, enclos): Failed to parse text>

典型的用法是将结果函数映射到 URL 向量:

A typical usage would be to map the resulting function across a vector of URLs:

c('http://tweg.com', 'http://wikipedia.org') %>% 
    map(safely(read_html, 'empty page'))
#> [[1]]
#> [[1]]$result
#> [1] "empty page"
#> 
#> [[1]]$error
#> <Rcpp::exception in eval(substitute(expr), envir, enclos): Failed to parse text>
#> 
#> 
#> [[2]]
#> [[2]]$result
#> {xml_document}
#> <html lang="mul" dir="ltr" class="no-js">
#> [1] <head>\n  <meta charset="utf-8"/>\n  <title>Wikipedia</title>\n  <me ...
#> [2] <body id="www-wikipedia-org">\n<h1 class="central-textlogo" style="f ...
#> 
#> [[2]]$error
#> NULL

这篇关于从 read_html 处理对空网页的错误响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆