如何读取R中未知编码的csv数据 [英] How to read csv data with unknown encoding in R
问题描述
我有一个 .csv
数据,我可以从一个网页查看它,但是当我读到 R
,有些数据无法显示。数据可在此处获取 home.ustc.edu.cn/~lanrr/data.csv
mydata = read.csv(http://home.ustc.edu.cn/~lanrr/data.csv,header = T)
View(mydata)#show something like this :
#9:39:37 665 600160 ɷ 8.050 100 805.00 ȯ ɽ
ȯ E004017669 665
2 9:39:38 697 930 4.360 283 1233.88
ɽ Ʒ 680001369 697
数据包含一些中文字,但是如果我需要更改编码或者做某些其他事情,以前有人遇到这个问题吗?
mydata = read.csv(http:// home。 ustc.edu.cn/~lanrr/data.csv,
encoding =UTF-8,header = T,stringsAsFactors = F)
View(mydata)
#9:39 :37 665 600160 < c2>< f4>
< U + 00B3>< f6> < c2>< f2>< c2>< f4> 8.050 100 805.00 c8< da>< U + 022F>
< U + 00B3>< U + 027D>< U + 00BB& < c8>< da>< U + 022F>< c2>< f4>< U + 00B3& E004017669 665
2 9:39:38 697 930< d6>< d0>< U + 0078>< c9>< u + 00BB& < c2>< f4>
< U + 00B3>< f6> < c2>< f2>< c2>< f4> 4.360 283 1233.88< d0>< c5>< d3>< c3>
< U + 00B3>< U + 027D>< U + 00BB& < U + 00B5>< U + 00A3>< U + 00B1>< U + 00A3>< U + 01B7&
< f6> 680001369 697
sessionInfo()
#R版本2.15.2(2012-10-26)
平台:x86_64-redhat-linux-gnu(64位)
locale:
[1] LC_CTYPE = en_US.UTF-8 LC_NUMERIC = C LC_TIME = en_US.UTF-8
LC_COLLATE = en_US.UTF-8
[5] LC_MONETARY = en_US.UTF-8 LC_MESSAGES = en_US.UTF-8 LC_PAPER = C
LC_NAME = C
[9] LC_ADDRESS = C LC_TELEPHONE = C LC_MEASUREMENT = zh_US.UTF-8
LC_IDENTIFICATION = C
附加的基本包:
[1]编译器stats图形grDevices utils数据集方法base
其他附加包:
[1] data.table_1 .8.8 TTR_0.22-0 xts_0.9-3 zoo_1.7-9
timeDate_2160.97 Matrix_1.0-9 lattice_0.20-10
通过命名空间加载(不附加):
[1] grid_2.15.2 tools_2.15.2
我这样做最后:
Sys.setlocale(LC_COLLATE,Chinese)
Sys.setlocale(LC_CTYPE, Chinese)
Sys.setlocale(LC_MONETARY,Chinese)
Sys.setlocale(LC_TIME,Chinese)
Sys.setlocale(LC_MESSAGES, )
Sys.setlocale(LC_MEASUREMENT,Chinese)
首先,该csv文件以GBK 编码,而不是 UTF-8,因此代码应为:
mydata< - read.csv(http://home.ustc.edu.cn/~lanrr/data.csv,
encoding =GBK,
header = TRUE,
stringsAsFactors = FALSE)
其次,如果你的env不是中文(简体)中,您应该set_locale如(我的示例os是Windows 7)
Sys.setlocale(category =LC_ALL,locale =中文),然后显示表格:
fix(mydata)
pre>
I have a
.csv
data, and I could view it from a webpage, but when I read it intoR
, some of the data couldn't be showed. The data is available herehome.ustc.edu.cn/~lanrr/data.csv
mydata = read.csv("http://home.ustc.edu.cn/~lanrr/data.csv", header = T) View(mydata) # show something like this: # 9:39:37 665 600160 ���ɷ� ���� ���� 8.050 100 805.00 ��ȯ �ɽ� ��ȯ���� E004017669 665 2 9:39:38 697 930 �������� ���� ���� 4.360 283 1233.88 ���� �ɽ� ����Ʒ���� 680001369 697
The data contains some Chinese words, but I don't if I need to change the encode or do some other things, has anyone meet this problem before?
mydata = read.csv("http://home.ustc.edu.cn/~lanrr/data.csv", encoding = "UTF-8", header = T, stringsAsFactors = F) View(mydata) # 9:39:37 665 600160 <U+00BE><U+07BB><U+00AF><U+00B9><U+0277><dd> <c2><f4> <U+00B3><f6> <c2><f2><c2><f4> 8.050 100 805.00 <c8><da><U+022F> <U+00B3><U+027D><U+00BB> <c8><da><U+022F><c2><f4><U+00B3><f6> E004017669 665 2 9:39:38 697 930 <d6><d0><U+0078><c9><fa><U+00BB><U+00AF> <c2><f4> <U+00B3><f6> <c2><f2><c2><f4> 4.360 283 1233.88 <d0><c5><d3><c3> <U+00B3><U+027D><U+00BB> <U+00B5><U+00A3><U+00B1><U+00A3><U+01B7><c2><f4><U+00B3> <f6> 680001369 697 sessionInfo() # R version 2.15.2 (2012-10-26) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] compiler stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.8 TTR_0.22-0 xts_0.9-3 zoo_1.7-9 timeDate_2160.97 Matrix_1.0-9 lattice_0.20-10 loaded via a namespace (and not attached): [1] grid_2.15.2 tools_2.15.2
I do it in this way finally:
Sys.setlocale("LC_COLLATE", "Chinese") Sys.setlocale("LC_CTYPE", "Chinese") Sys.setlocale("LC_MONETARY", "Chinese") Sys.setlocale("LC_TIME", "Chinese") Sys.setlocale("LC_MESSAGES", "Chinese") Sys.setlocale("LC_MEASUREMENT", "Chinese")
解决方案First, that csv file in encoded in GBK not UTF-8, so the code should be:
mydata <- read.csv("http://home.ustc.edu.cn/~lanrr/data.csv", encoding = "GBK", header = TRUE, stringsAsFactors = FALSE)
Second, if your env is not Chinese (Simplified), you should set_locale such as (my example os is windows 7)
Sys.setlocale(category = "LC_ALL", locale = "Chinese (Simplified)"), and then show the table with:
fix(mydata)
这篇关于如何读取R中未知编码的csv数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!