curlMultiperform(multihandle)中的错误:在字符串中嵌入nul [英] Error in curlMultiperform(multihandle): embedded nul in string

查看:417
本文介绍了curlMultiperform(multihandle)中的错误:在字符串中嵌入nul的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图下载一个链接的向量,但我得到一个错误消息,我不知道该怎么办。代码包括,希望有人有一个解决方法。

I'm trying to download a vector of links, but I get an error message that I don't know what to do with. Code included, hoping someone has a workaround.

CODE:

library(RCurl)
library(XML) 
url <- "http://www.etfs.bmo.com/bmo-etfs/"
url.parsed <- htmlParse(url)
links <- xpathSApply(url.parsed, "//table//td/a/@href")[-c(1:3)]
links <- paste0("http://www.etfs.bmo.com", links)
pages <- getURI(links)

错误信息:

Error in curlMultiPerform(multiHandle) : 
  embedded nul in string: '         \r\n                            </nobr>\r\n                        </td>\r\n\t\t\t        </tr>\r\n\t\t\t        \r\n\t\t\t\t\t        \r\n\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t        \r\n\t\t\t\t        \t<tr valign="top" >\r\n\t                \t\t\t\t<td class="highlightText"><strong>Annualized Distribution Yield \r\n\t\t                       \t\t\t\r\n\t\t   \t               \t\t\t\t\r\n\t\t    \t            \t\t\t(Jul 07, 2016)\r\n\t\t           \t       \t\t\t\t\r\n\t\t               \t       \t\t\t \r\n\t\t               \t\t\t\t\t<sup>1</sup></strong>\r\n\t\t               \t\t\t\t</td>\r\n\t\t\t            \t\t<td>\r\n                            \t\t<nobr>\r\n   \t                            \t\t\r\n    \t                        \t\t\t\r\n                \t            \t\t\t\r\n\t\t\t        \t         \t\t\t\t2.41%\r\n                        \t    \t\t\t\r\n                           \t\t\t\t \r\n    \t                        \t</nobr>\r\


推荐答案

好的,这花了一段时间,但我想我已经弄清楚了。

Ok, this took a while but I think i've figure it out.

事实证明,网页编码不正确。它声称是ISO-8859-1,但在一些页面上有商标符号编码为 \x99 这意味着它可能真的是使用Windows-1252 代码页。这个符号超出了正常的ASCII范围,会跳过多字节字符读取,文件很快就会乱码。

It turns out that webpage is improperly encoded. It claims to be "ISO-8859-1", but on some pages there is trademark symbol encoded as \x99 which means it probably really is using the "Windows-1252" codepage. This symbol outside the normal ASCII range kicks off multi-byte character reading and the file quickly becomes messed up.

据我所知,RCurl不支持这种编码本机。但您仍然可以将文件下载为二进制数据,然后使用具有更多转换选项的 iconv 进行转换。这应该工作

As far as I can tell, RCurl does not support this encoding natively. But you can still download the file as binary data and then convert it using iconv which has more conversion options. This should work

raw <- lapply(links, getURLContent, binary=TRUE)
pages <- lapply(lapply(raw,readBin,"characer"), 
    iconv, from="WINDOWS-1252", to="UTF-8")

现在我在我的Mac上测试了这个。确切的from / to字符串可能因平台而异。检查 iconvlist()中的列表是否可能替换 from = 值,如果这在您的计算机上不起作用。

Now I tested this on my Mac. The exact from/to strings may vary by platform. Check the list from iconvlist() for a possible replacement for the from= value should this not work on your machine.

这篇关于curlMultiperform(multihandle)中的错误:在字符串中嵌入nul的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆