read_html 的编码错误 [英] encoding error with read_html

查看:29
本文介绍了read_html 的编码错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取一个页面.我想过使用 rvest 包.但是,我卡在了第一步,即使用 read_html 读取内容.这是我的代码:

I am trying to web scrape a page. I thought of using the package rvest. However, I'm stuck in the first step, which is to use read_html to read the content. Here´s my code:

library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
obra_caridade <- read_html(url,
                        encoding = "ISO-8895-1")

我收到以下错误:

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Input is not proper UTF-8, indicate encoding !
Bytes: 0xE3 0x6F 0x20 0x65 [9]

我尝试使用类似问题的答案作为答案,但没有解决我的问题:

I tried using what similar questions had as answers, but it did not solve my issue:

obra_caridade <- read_html(iconv(url, to = "UTF-8"),
                        encoding = "UTF-8")

obra_caridade <- read_html(iconv(url, to = "ISO-8895-1"),
                        encoding = "ISO-8895-1")

两次尝试都返回了类似的错误.有没有人对如何解决这个问题有任何建议?这是我的会话信息:

Both attempts returned a similar error. Does anyone has any suggestion about how to solve this issue? Here's my session info:

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rvest_0.3.2 xml2_1.1.1 

loaded via a namespace (and not attached):
[1] httr_1.2.1   magrittr_1.5 R6_2.2.1     tools_3.3.1  curl_2.6     Rcpp_0.12.11

推荐答案

有什么问题?

您的问题在于正确确定网页的编码.

What's the issue?

Your issue here is in correctly determining the encoding of the webpage.

好消息
由于您查看了源代码并找到了以 ISO-8895-1 给出的元字符集,因此您的方法对我来说看起来不错.当然了解编码是理想的,而不必靠猜测.

坏消息
我不相信编码存在.首先,当我在网上搜索它时,结果往往看起来像错别字.其次,R 通过 iconvlist() 为您提供了支持的编码列表.ISO-8895-1 不在列表中,因此将其作为 read_html 的参数输入是没有用的.我认为如果输入不受支持的编码会引发警告会很好,但这似乎不会发生.

The bad news
I don't believe that encoding exists. Firstly, when I search for it online the results tend to look like typos. Secondly, R provides you with a list of supported encodings via iconvlist(). ISO-8895-1 is not in the list, so entering it as an argument to read_html isn't useful. I think it'd be nice if entering a non-supported encoding threw a warning, but this doesn't seem to happen.

快速解决方案
正如@MrFlick 在评论中所建议的,使用 encoding = "latin1" 似乎有效.
我怀疑 Meta 字符集有一个错字,它应该是 ISO-8859-1(与 latin1 相同).

Quick solution
As suggested by @MrFlick in a comment, using encoding = "latin1" appears to work.
I suspect the Meta charset has a typo and it should read ISO-8859-1 (which is the same thing as latin1).

您的浏览器在做什么?
在浏览器中加载页面时,您可以查看它使用什么编码来阅读页面.如果页面看起来正确,这似乎是一个明智的猜测.在这种情况下,Firefox 使用西方编码(即 ISO-8859-1).

What is your browser doing?
When loading the page in a browser, you can see what encoding it is using to read the page. If the page looks right, this seems like a sensible guess. In this instance, Firefox uses Western encoding (i.e. ISO-8859-1).

用 R 猜测

  1. rvest::guess_encoding 是一个很好的、用户友好的函数,可以给出快速估计.您可以为该功能提供一个网址,例如guess_encoding(url),或复制具有更复杂字符的短语,例如guess_encoding("Situação do Termo/Convênio:").
    关于这个函数需要注意的一点是它只能从 30 种更常见的编码中进行检测,但还有更多的可能性.

  1. rvest::guess_encoding is a nice, user-friendly function which can give a quick estimate. You can provide the function with a url e.g. guess_encoding(url), or copy in phrases with more complex characters e.g. guess_encoding("Situação do Termo/Convênio:").
    One thing to note about this function is it can only detect from 30 of the more common encodings, but there are many more possibilities.

如前所述,iconvlist() 提供了支持的编码列表.通过遍历这些编码并检查页面中的一些文本以查看它是否符合我们的预期,我们最终应该得到一个可能的编码的候选清单(并排除许多编码).
示例代码可以在此答案的底部找到.

As mentioned earlier, iconvlist() provides a list of supported encodings. By looping through these encodings and examining some text in the page to see if it's what we expect, we should end up with a shortlist of possible encodings (and rule many encodings out).
Sample code can be found at the bottom of this answer.

最后评论
以上所有都表明 ISO-8859-1 是对编码的合理猜测.

Final comments
All the above points towards ISO-8859-1 being a sensible guess for the encoding.

页面 url 包含一个 .br 扩展名,表明它是巴西的,并且 - 根据到维基百科 - 这种编码对巴西葡萄牙语有完整的语言覆盖,这表明对于创建网页的人来说,这可能不是一个疯狂的选择.我相信这也是一种相当常见的编码类型.

The page url contains a .br extension indicating it's Brazilian, and - according to Wikipedia - this encoding has complete language coverage for Brazilian Portuguese, which suggests it might not be a crazy choice for whoever created the webpage. I believe this is also a reasonably common encoding type.

'Guessing with R'点 2 的示例代码(使用 iconvlist()):

Sample code for 'Guessing with R' point 2 (using iconvlist()):

library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"

# 1. See which encodings don't throw an error
read_page <- lapply(unique(iconvlist()), function(encoding_attempt) {

  # Optional print statement to show progress to 1 since this can take some time
  print(match(encoding_attempt, iconvlist()) / length(iconvlist()))

  read_attempt <- tryCatch(expr=read_html(url, encoding=encoding_attempt),
                           error=function(condition) NA,
                           warning=function(condition) message(condition))
  return(read_attempt)
})

names(read_page) <- unique(iconvlist())

# 2. See which encodings correctly display some complex characters
read_phrase <- lapply(x, function(encoded_page) 
  if(!is.na(encoded_page))
    html_text(html_nodes(encoded_page, ".dl-horizontal:nth-child(1) dt")))

# We've ended up with 27 encodings which could be sensible...
encoding_shortlist <- names(read_phrase)[read_phrase == "Situação:"]

这篇关于read_html 的编码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆