奇怪的字符:R 和 Windows 语言环境的交互? [英] strange characters: interaction of R and Windows locale?

查看:10
本文介绍了奇怪的字符:R 和 Windows 语言环境的交互?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

WinXP-x32、R-2.13.0

亲爱的名单,

我有一个问题(我认为)与 Windows 和 R 之间的交互有关.

我正在尝试用夏威夷群岛的数据抓取一张表格.这是我的 R 代码:

库(XML)u <- "http://en.wikipedia.org/wiki/Hawaii"表格 <- readHTMLTable(u)岛屿 <- 桌子[[5]]

输出是(第一组列):

<块引用>

 岛屿昵称 >>岛屿岛屿昵称 >>位置 1 HawaiÊ»i[7] The Big

岛 19°34′N 155°30′W/19.567°N 155.5°W/19.567;-155.5 2 Maui[8] The Valley Isle 20°48′N 156â°20′W/20.8°N 156.333°W/20.8;-156.333 3 KahoÊ»olawe[9] 目标岛 20°33′N156°36°W/20.55°N156.6°W/20.55;-156.6 4 LÄnaÊ»i[10] 菠萝岛20°50′N 156°56′W/20.833°N 156.933°W/20.833;-156.933 5 MolokaÊ»i[11] 友好岛 21°08′N157°02′W/21.133â°N157.033°W/21.133;-157.033 6 OÊ»ahu[12] 聚集地21°28′N 157°59′W/21.467°N 157.983°W/21.467;-157.983 7 KauaÊ»i[13] 花园岛 22°05′N159°30′W/22.083â°N159.5°W/22.083;-159.5 8 NiÊ»ihau[14] 禁岛
21°54°N 160°10°W/21.9°N160.167°W/21.9;-160.167

如您所见,其中有奇怪"的字符.我也试过 readHTMLTable(u, encoding = "UTF-16")readHTMLTable(u, encoding = "UTF-8")但这没有帮助.

在我看来,字符集和 R 的 Windows 设置的交互可能存在问题.

sessionInfo() 给出

<代码>>会话信息()R 版本 2.13.0 (2011-04-13)平台:i386-pc-mingw32/i386(32位)语言环境:[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252[4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252附加的基础包:[1] stats graphics grDevices utils datasets methods base其他附加包:[1] XML_3.2-0.2

我还尝试通过输入:Sys.setlocale("LC_ALL", "en_US.UTF-8") 让 R 使用另一个设置,但这会产生响应:

<代码>>Sys.setlocale("LC_ALL", "en_US.UTF-8")[1] ""警告信息:在 Sys.setlocale("LC_ALL", "en_US.UTF-8") :操作系统报告将语言环境设置为en_US.UTF-8"的请求无法兑现

此外,我尝试直接从 Windows 命令提示符进行更改,使用:chcp 65001 及其变体,但这并没有改变任何东西.

我在网上搜索时注意到其他人也有此问题,但无法找到解决方案.我看起来这是一个关于 Windows 和 R 如何交互的问题.不幸的是,我可以使用的所有三台计算机都有这个问题.在 WinXP-x32 和 Win7-x86 下都会出现.

有没有办法让 R 覆盖 Windows 设置,或者问题可以通过其他方式解决吗?我也试过其他网站,每次要刮的文字里有é、ü、ä、î等时都会出现这个问题.

谢谢你,罗杰

解决方案

一个不完全的答案:

如果您查看维基百科页面并将浏览器中的编码(在 IE 中,查看 -> 编码;在 Firefox 中,查看 -> 字符编码)更改为西方 (ISO-8869-1) 或西方 (Windows-1252) 然后你会看到愚蠢的字符.这应该意味着您可以使用 iconv 来更改编码并解决您的问题.

#将因子转换为字符岛屿 <- as.data.frame(lapply(Islands, as.character), stringsAsFactors = FALSE)iconv(Islands$Island, "windows-1252", "UTF-8")

不幸的是,它不起作用.使用不同的转换可能会得到正确的文本(iconvlist() 显示了所有可能性).

它可能只是简单地去掉有问题的字符,但这并不理想.

iconv(Islands$Island, "windows-1252", "ASCII", "")

WinXP-x32, R-2.13.0

Dear list,

I have a problem that (I think) relates to the interaction between Windows and R.

I am trying to scrape a table with data on the Hawai'ian Islands. This is my R code:

library(XML)
u <- "http://en.wikipedia.org/wiki/Hawaii"
tables <- readHTMLTable(u)
Islands <- tables[[5]]

The output is (first set of columns):

      Island            Nickname                                                                  > > Islands
      Island            Nickname                                                                  > > Location 1    Hawaiʻi[7]      The Big

Island 19°34′N 155°30′W / 19.567°N 155.5°W / 19.567; -155.5 2 Maui[8] The Valley Isle 20°48′N 156°20′W / 20.8°N 156.333°W / 20.8; -156.333 3 KahoÊ»olawe[9] The Target Isle 20°33′N 156°36′W / 20.55°N 156.6°W / 20.55; -156.6 4 LÄnaÊ»i[10] The Pineapple Isle 20°50′N 156°56′W / 20.833°N 156.933°W / 20.833; -156.933 5 MolokaÊ»i[11] The Friendly Isle 21°08′N 157°02′W / 21.133°N 157.033°W / 21.133; -157.033 6 OÊ»ahu[12] The Gathering Place 21°28′N 157°59′W / 21.467°N 157.983°W / 21.467; -157.983 7 KauaÊ»i[13] The Garden Isle 22°05′N 159°30′W / 22.083°N 159.5°W / 22.083; -159.5 8 NiÊ»ihau[14] The Forbidden Isle
21°54′N 160°10′W / 21.9°N 160.167°W / 21.9; -160.167

As you can see, there are "weird" characters in there. I have also tried readHTMLTable(u, encoding = "UTF-16") and readHTMLTable(u, encoding = "UTF-8") but that didn't help.

It seems to me that there may be an issue with the interaction of the Windows settings of the character set and R.

sessionInfo() gives

> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252  

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] XML_3.2-0.2

I have also attempted to let R use another setting by entering: Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response:

> Sys.setlocale("LC_ALL", "en_US.UTF-8")
[1] ""
Warning message:
In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
  OS reports request to set locale to "en_US.UTF-8" cannot be honored

In addition, I have attempted to make the change directly from the windows command prompt, using: chcp 65001 and variations of that, but that didn't change anything.

I noticed from searching the web that others have the issue as well, but have not been able to find a solution. I looks like this is an issue of how Windows and R interact. Unfortunately, all three computers at my disposal have this problem. It occurs both under WinXP-x32 and under Win7-x86.

Is there a way to make R override the windows settings or can the issue be solved otherwise? I have also tried other websites, and the issue occurs every time when there is an é, ü, ä, î, et cetera in the text-to-be-scraped.

Thank you, Roger

解决方案

A not quite an answer:

If you look at the wikipedia page and change the encoding in your browser (in IE, View -> Encoding; in Firefox, View -> Character Encoding) to Western (ISO-8869-1) or Western (Windows-1252) then you see the silly characters. That ought to mean that you can use iconv to change the encoding and fix your problems.

#Convert factors to character
Islands <- as.data.frame(lapply(Islands, as.character), stringsAsFactors = FALSE)

iconv(Islands$Island, "windows-1252", "UTF-8")

Unfortunately, it doesn't work. It may be possible to get the correct text by using a different conversion (iconvlist() shows all the possibilities).

It is possible it simply strip out the offending characters, though this isn't ideal.

iconv(Islands$Island, "windows-1252", "ASCII", "")

这篇关于奇怪的字符:R 和 Windows 语言环境的交互?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆