奇怪的字符:R和Windows语言环境的交互? [英] strange characters: interaction of R and Windows locale?
问题描述
WinXP-x32,R-2.13.0
WinXP-x32, R-2.13.0
亲爱的名单,
我有一个问题(我认为)与Windows和R之间的交互有关.
I have a problem that (I think) relates to the interaction between Windows and R.
我正试图用夏威夷群岛上的数据抓取一张桌子.这是我的R代码:
I am trying to scrape a table with data on the Hawai'ian Islands. This is my R code:
library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]
输出为(第一组列):
Island Nickname > > Islands
Island Nickname > > Location 1 Hawaiʻi[7] The Big
岛19°34â€ñN155°30′W/
19.567°N 155.5°Wï¿¿/19.567;
-155.5 2毛伊岛[8]山谷小岛20°48′N156°20′W/
20.8°N 156.333°W/20.8;
-156.333 3KahoÊ»olawe [9]目标岛20°33′N
156°36 W/20.55°N
156.6°W/20.55; -156.6 4LänaÊ»i [10]菠萝岛
20°50âN²156°56′W/
20.833°N 156.933°Wï¿¿/20.833;
-156.933 5MolokaÊ»i [11]友好岛21°08′N
157°02′W/21.133°N
157.033°威斯康星州/21.133; -157.033 6 Oa»ahu [12]聚会的地方
21°28′N157°59′W/
21.21.467°N 157.983°W/21.467;
-157.983 7KauaÊ»i [13]花园岛22°05′N
159°30′W/22.083°N
159.5°W/22.083; -159.5 8NiÊ»ihau [14]禁忌之岛
21°54âN²N160°10′W/21.9°N
160.167°W/21.9; -160.167
Island 19°34′N 155°30′W /
19.567°N 155.5°W / 19.567;
-155.5 2 Maui[8] The Valley Isle 20°48′N 156°20′W /
20.8°N 156.333°W / 20.8;
-156.333 3 Kahoʻolawe[9] The Target Isle 20°33′N
156°36′W / 20.55°N
156.6°W / 20.55; -156.6 4 LÄnaÊ»i[10] The Pineapple Isle
20°50′N 156°56′W /
20.833°N 156.933°W / 20.833;
-156.933 5 Molokaʻi[11] The Friendly Isle 21°08′N
157°02′W / 21.133°N
157.033°W / 21.133; -157.033 6 Oʻahu[12] The Gathering Place
21°28′N 157°59′W /
21.467°N 157.983°W / 21.467;
-157.983 7 Kauaʻi[13] The Garden Isle 22°05′N
159°30′W / 22.083°N
159.5°W / 22.083; -159.5 8 Niʻihau[14] The Forbidden Isle
21°54′N 160°10′W / 21.9°N
160.167°W / 21.9; -160.167
如您所见,其中有怪异"字符.我也尝试过readHTMLTable(u, encoding = "UTF-16")
和readHTMLTable(u, encoding = "UTF-8")
但这没有帮助.
As you can see, there are "weird" characters in there. I have also tried readHTMLTable(u, encoding = "UTF-16")
and readHTMLTable(u, encoding = "UTF-8")
but that didn't help.
在我看来,字符集的Windows设置与R的交互可能存在问题.
It seems to me that there may be an issue with the interaction of the Windows settings of the character set and R.
sessionInfo()
给出
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.2-0.2
我还尝试通过输入以下内容让R使用其他设置:Sys.setlocale("LC_ALL", "en_US.UTF-8")
,但这会产生响应:
I have also attempted to let R use another setting by entering: Sys.setlocale("LC_ALL", "en_US.UTF-8")
, but this yields the response:
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
OS reports request to set locale to "en_US.UTF-8" cannot be honored
此外,我尝试使用以下命令直接在Windows命令提示符下进行更改:chcp 65001
及其变化,但没有任何改变.
In addition, I have attempted to make the change directly from the windows command prompt, using: chcp 65001
and variations of that, but that didn't change anything.
我在搜索网络时注意到,其他人也遇到了问题,但未能找到解决方案.我看起来这是Windows和R如何交互的问题.不幸的是,我可以使用的所有三台计算机都存在此问题.它同时在WinXP-x32和Win7-x86下发生.
I noticed from searching the web that others have the issue as well, but have not been able to find a solution. I looks like this is an issue of how Windows and R interact. Unfortunately, all three computers at my disposal have this problem. It occurs both under WinXP-x32 and under Win7-x86.
是否有一种方法可以使R覆盖Windows设置,否则可以解决该问题吗? 我也尝试过其他网站,并且每次要刮擦的文本中出现é,ü,ä,î等时,都会发生此问题.
Is there a way to make R override the windows settings or can the issue be solved otherwise? I have also tried other websites, and the issue occurs every time when there is an é, ü, ä, î, et cetera in the text-to-be-scraped.
谢谢你, 罗杰
推荐答案
一个不太完全的答案:
如果您查看维基百科页面并将浏览器中的编码(在IE中,查看->编码;在Firefox中,查看->字符编码)更改为Western(ISO-8869-1)或Western(Windows-1252) ),那么您会看到愚蠢的角色.那应该意味着您可以使用iconv
更改编码并解决问题.
If you look at the wikipedia page and change the encoding in your browser (in IE, View -> Encoding; in Firefox, View -> Character Encoding) to Western (ISO-8869-1) or Western (Windows-1252) then you see the silly characters. That ought to mean that you can use iconv
to change the encoding and fix your problems.
#Convert factors to character
Islands <- as.data.frame(lapply(Islands, as.character), stringsAsFactors = FALSE)
iconv(Islands$Island, "windows-1252", "UTF-8")
不幸的是,它不起作用.通过使用不同的转换可能会获得正确的文本(iconvlist()
显示了所有可能性).
Unfortunately, it doesn't work. It may be possible to get the correct text by using a different conversion (iconvlist()
shows all the possibilities).
尽管这并不理想,但它可能会简单地删除不受欢迎的字符.
It is possible it simply strip out the offending characters, though this isn't ideal.
iconv(Islands$Island, "windows-1252", "ASCII", "")
这篇关于奇怪的字符:R和Windows语言环境的交互?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!