在R中从列表转换为data.frame时,utf-8字符丢失 [英] utf-8 characters get lost when converting from list to data.frame in R

查看:925
本文介绍了在R中从列表转换为data.frame时,utf-8字符丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用R 3.2.0与RStudio 0.98.1103在Windows 7 64位。我的电脑的Windows区域和语言设置是英语(美国)。

I am using R 3.2.0 with RStudio 0.98.1103 on Windows 7 64-bit. The Windows "regional and language settings" of my computer is English (United States).

由于某种原因下面的代码替换了我的捷克字符č在文本Koryčanynadpřehradou中,通过c和r,当我从web中读取utf-8编码的XML文件时,将XML文件解析为列表,并将列表转换为data.frame。

For some reason the following code replaces my Czech characters "č" and "ř" by "c" and "r" in the text "Koryčany nad přehradou", when I read a XML file in utf-8 encoding from the web, parse the XML file to a list, and convert the list to a data.frame.

library(XML)
url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken="
doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE))
infoList <- xmlToList(doc[[2]][[1]])
siteName <- infoList$siteName

#this still displays correctly "Koryčany nad přehradou"
print(siteName) 

#make a data.frame from the list item. I suspect here is the problem.
df <- data.frame(name=siteName, id=1)

#now the Czech characters are lost. I see only "Korycany nad prehradou"
View(df) 

write.csv(df,"test.csv")
#the test.csv file also contains "Korycany nad prehradou" 
#instead of "Koryčany nad přehradou"

有什么问题?如何让R使用所有utf-8特殊字符正确显示我的data.frame并保存.csv文件而不丢失č和ř捷克语字符?

What is the problem? How do I make R to show my data.frame correctly with all the utf-8 special characters and save the .csv file without losing the "č" and "ř" Czech characters?

推荐答案

这不是一个完美的答案,但下面的解决方法解决了我的问题。我试图理解的行为或R,并使示例,使我的R脚本在Windows和Linux平台上产生相同的结果:

This is not a perfect answer, but the following workaround solved the problem for me. I tried to understand the behavior or R, and make the example so that my R script produces the same results both on Windows and on Linux platform:

(1)获取XML来自互联网的UTF-8数据

(1) Get XML data in UTF-8 from the Internet

library(XML)
url <- "http://hydrodata.info/chmi-h/cuahsi_1_1.asmx/GetSiteInfoObject?site=CHMI-H:1263&authToken="
doc <- xmlRoot(xmlTreeParse(url, getDTD=FALSE, useInternalNodes = TRUE))
infoList <- xmlToList(doc[[2]][[1]])
siteName <- infoList$siteName

(2)从Internet打印文本:编码为UTF-8,使用Windows上的捷克语和英语区域设置,R控制台中的显示也正确:

(2) Print out the text from the Internet: Encoding is UTF-8, display in the R console is also correct using both the Czech and the English locale on Windows:

> Sys.getlocale(category="LC_CTYPE")
[1] "English_United States.1252"
> print(siteName)
[1] "Koryčany nad přehradou"
> Encoding(siteName)
[1] "UTF-8"
> 

(3)尝试创建和查看data.frame。这有一个问题。在RStudio视图和控制台中,data.frame显示不正确。

(3) Try to create and view a data.frame. This has a problem. The data.frame displays incorrectly both in the RStudio view and in the console:

df <- data.frame(name=siteName, id=1)
df
                    name id
1 Korycany nad prehradou  1

(4)尝试使用矩阵。令人惊讶的是,矩阵在R控制台中正确显示。

(4) Try to use a matrix instead. Surprisingly the matrix displays correctly in the R console.

m <- as.matrix(df)
View(m)  #this shows incorrectly in RStudio
m        #however, this shows correctly in the R console.
     name                     id 
[1,] "Koryčany nad přehradou" "1"

(5)更改区域设置。如果我在Windows上,将区域设置为捷克语。如果我在Unix或Mac,设置区域设置为UTF-8。注意:当我在RStudio中运行脚本时,这有一些问题,显然RStudio不总是立即响应Sys.setlocale命令。

(5) Change the locale. If I'm on Windows, set locale to Czech. If I'm on Unix or Mac, set locale to UTF-8. NOTE: This has some problems when I run the script in RStudio, apparently RStudio doesn't always react immediately to the Sys.setlocale command.

#remember the original locale.
original.locale <- Sys.getlocale(category="LC_CTYPE")

#for Windows set locale to Czech. Otherwise set locale to UTF-8
new.locale <- ifelse(.Platform$OS.type=="windows", "Czech_Czech Republic.1250", "en_US.UTF-8")
Sys.setlocale("LC_CTYPE", new.locale) 

(7)将数据写入文本文件。重要信息:不要使用 write.csv ,而应使用 write.table 。当我的语言环境是在我的英语Windows上的 Czech 时,我必须使用 fileEncoding =UTF-8 code> write.table 。现在文本文件在notepad ++和Excel中正确显示。

(7) Write the data to a text file. IMPORTANT: don't use write.csv but instead use write.table. When my locale is Czech on my English Windows, I must use the fileEncoding="UTF-8" in the write.table. Now the text file shows up correctly in notepad++ and in also in Excel.

write.table(m, "test-czech-utf8.txt", sep="\t", fileEncoding="UTF-8")

(8)将区域设置恢复为原始

(8) Set the locale back to original

Sys.setlocale("LC_CTYPE", original.locale)

(9)尝试将文本文件读回R中。注意:如果我读取文件, encoding 参数(NOT fileEncoding!)。从文件读取的data.frame的显示仍然不正确,但是当我将 data.frame 转换为矩阵保留捷克语UTF-8字符:

(9) Try to read the text file back into R. NOTE: If I read the file, I had to set the encoding parameter (NOT fileEncoding !). The display of a data.frame read from the file is still incorrect, but when I convert my data.frame to a matrix the Czech UTF-8 characters are preserved:

data.from.file <- read.table("test-czech-utf8.txt", sep="\t", encoding="UTF-8")
#the data.frame still has the display problem, "č" and "ř" get "lost"
> data.from.file
                     name id
1 Korycany nad prehradou  1

#see if a matrix displays correctly: YES it does!
matrix.from.file <- as.matrix(data.from.file)
> matrix.from.file
  name                     id 
1 "Koryčany nad přehradou" "1"

所以学到的教训是,我需要将我的 data.frame 转换为矩阵,设置我的区域设置在我用捷克语写数据之前,将捷克语(在Windows上)或 UTF-8 字符到文件。然后,当我写文件,我必须确保 fileEncoding 必须设置为UTF-8。另一方面,当我稍后阅读文件时,我可以继续工作在英语语言环境,但在 read.table 我必须设置 encoding = UTF-8

So the lesson learnt is that I need to convert my data.frame to a matrix, set my locale to Czech (on Windows) or to UTF-8 (on Mac and Linux) before I write my data with Czech characters to a file. Then when I write the file, I must make sure fileEncoding must be set to UTF-8. On the other hand when I later read the file, I can keep working in the English locale, but in read.table I must set the encoding="UTF-8".

如果有人有更好的解决方案,我欢迎您的建议。

If anybody has a better solution, I'll welcome your suggestions.

这篇关于在R中从列表转换为data.frame时,utf-8字符丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆