如何读取R中未知编码的csv数据 [英] How to read csv data with unknown encoding in R

查看：272 发布时间：2017/2/24 19:40:27 r csv input encode

本文介绍了如何读取R中未知编码的csv数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 .csv 数据，我可以从一个网页查看它，但是当我读到 R ，有些数据无法显示。数据可在此处获取 home.ustc.edu.cn/~lanrr/data.csv

  mydata = read.csv（http://home.ustc.edu.cn/~lanrr/data.csv，header = T）
 View（mydata）＃show something like this ：
＃9:39:37 665 600160 ޻  ɷ         8.050 100 805.00  ȯ ɽ 
  ȯ    E004017669 665 
 2 9:39:38 697 930                4.360 283 1233.88 
     ɽ     Ʒ     680001369 697

数据包含一些中文字，但是如果我需要更改编码或者做某些其他事情，以前有人遇到这个问题吗？

  mydata = read.csv（http：// home。 ustc.edu.cn/~lanrr/data.csv，
 encoding =UTF-8，header = T，stringsAsFactors = F）
 View（mydata）
＃9:39 ：37 665 600160      < c2>< f4> 
< U + 00B3>< f6> < c2>< f2>< c2>< f4> 8.050 100 805.00 c8< da>< U + 022F> 
< U + 00B3>< U + 027D>< U + 00BB& < c8>< da>< U + 022F>< c2>< f4>< U + 00B3& E004017669 665 
 2 9:39:38 697 930< d6>< d0>< U + 0078>< c9>< u + 00BB& < c2>< f4> 
< U + 00B3>< f6> < c2>< f2>< c2>< f4> 4.360 283 1233.88< d0>< c5>< d3>< c3> 
< U + 00B3>< U + 027D>< U + 00BB& < U + 00B5>< U + 00A3>< U + 00B1>< U + 00A3>< U + 01B7& 
< f6> 680001369 697 
 
 sessionInfo（）
＃R版本2.15.2（2012-10-26）
平台：x86_64-redhat-linux-gnu（64位）
 
 locale：
 [1] LC_CTYPE = en_US.UTF-8 LC_NUMERIC = C LC_TIME = en_US.UTF-8 
 LC_COLLATE = en_US.UTF-8 
 [5] LC_MONETARY = en_US.UTF-8 LC_MESSAGES = en_US.UTF-8 LC_PAPER = C 
 LC_NAME = C 
 [9] LC_ADDRESS = C LC_TELEPHONE = C LC_MEASUREMENT = zh_US.UTF-8 
 LC_IDENTIFICATION = C 
 
附加的基本包：
 [1]编译器stats图形grDevices utils数据集方法base 
 
其他附加包：
 [1] data.table_1 .8.8 TTR_0.22-0 xts_0.9-3 zoo_1.7-9 
 timeDate_2160.97 Matrix_1.0-9 lattice_0.20-10 
 
通过命名空间加载（不附加）：
 [1] grid_2.15.2 tools_2.15.2

我这样做最后：

  Sys.setlocale（LC_COLLATE，Chinese）
 Sys.setlocale（LC_CTYPE， Chinese）
 Sys.setlocale（LC_MONETARY，Chinese）
 Sys.setlocale（LC_TIME，Chinese）
 Sys.setlocale（LC_MESSAGES， ）
 Sys.setlocale（LC_MEASUREMENT，Chinese）

解决方案

首先，该csv文件以GBK 编码，而不是 UTF-8，因此代码应为：

  mydata<  -  read.csv（http://home.ustc.edu.cn/~lanrr/data.csv，
 encoding =GBK，
 header = TRUE，
 stringsAsFactors = FALSE）

其次，如果你的env不是中文（简体）中，您应该set_locale如（我的示例os是Windows 7）

Sys.setlocale（category =LC_ALL，locale =中文），然后显示表格：

  fix（mydata）
  pre> 
I have a .csv data, and I could view it from a webpage, but when I read it into R, some of the data couldn't be showed. The data is available here home.ustc.edu.cn/~lanrr/data.csv
mydata = read.csv("http://home.ustc.edu.cn/~lanrr/data.csv", header = T)
View(mydata)  # show something like this:
# 9:39:37   665 600160  �޻��ɷ�  ����    ����    8.050   100 805.00  ��ȯ �ɽ�        
  ��ȯ����   E004017669  665
  2 9:39:38 697 930 ��������    ����    ����    4.360   283 1233.88    
  ����  �ɽ� ����Ʒ����   680001369   697
The data contains some Chinese words, but I don't if I need to change the encode or do some other things, has anyone meet this problem before?
mydata = read.csv("http://home.ustc.edu.cn/~lanrr/data.csv", 
                   encoding = "UTF-8", header = T, stringsAsFactors = F)
View(mydata)
# 9:39:37   665 600160  <U+00BE><U+07BB><U+00AF><U+00B9><U+0277><dd>    <c2><f4>  
  <U+00B3><f6>  <c2><f2><c2><f4>    8.050   100 805.00  <c8><da><U+022F>     
  <U+00B3><U+027D><U+00BB>  <c8><da><U+022F><c2><f4><U+00B3><f6>    E004017669  665
  2 9:39:38 697 930 <d6><d0><U+0078><c9><fa><U+00BB><U+00AF>    <c2><f4>
  <U+00B3><f6>  <c2><f2><c2><f4>    4.360   283 1233.88 <d0><c5><d3><c3>    
  <U+00B3><U+027D><U+00BB>  <U+00B5><U+00A3><U+00B1><U+00A3><U+01B7><c2><f4><U+00B3> 
  <f6>  680001369   697

sessionInfo()
# R version 2.15.2 (2012-10-26)
  Platform: x86_64-redhat-linux-gnu (64-bit)

  locale:
   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8              
   LC_COLLATE=en_US.UTF-8    
   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=C                   
   LC_NAME=C                 
   [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8  
   LC_IDENTIFICATION=C       

   attached base packages:
   [1] compiler  stats     graphics  grDevices utils     datasets  methods   base     

   other attached packages:
   [1] data.table_1.8.8 TTR_0.22-0       xts_0.9-3        zoo_1.7-9           
   timeDate_2160.97 Matrix_1.0-9     lattice_0.20-10 

   loaded via a namespace (and not attached):
   [1] grid_2.15.2  tools_2.15.2
I do it in this way finally:
Sys.setlocale("LC_COLLATE", "Chinese")
Sys.setlocale("LC_CTYPE", "Chinese")
Sys.setlocale("LC_MONETARY", "Chinese")
Sys.setlocale("LC_TIME", "Chinese")
Sys.setlocale("LC_MESSAGES", "Chinese")
Sys.setlocale("LC_MEASUREMENT", "Chinese")

 解决方案 
First, that csv file in encoded in GBK not UTF-8, so the code should be:
mydata <- read.csv("http://home.ustc.edu.cn/~lanrr/data.csv", 
                    encoding = "GBK", 
                    header = TRUE, 
                    stringsAsFactors = FALSE)
Second, if your env is not Chinese (Simplified), you should set_locale such as (my example os is windows 7)

Sys.setlocale(category = "LC_ALL", locale = "Chinese (Simplified)"), and then show the table with:
fix(mydata)


                        
这篇关于如何读取R中未知编码的csv数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何读取R中未知编码的csv数据 [英] How to read csv data with unknown encoding in R

问题描述

相关文章

Office最新文章

热门教程

热门工具

登录关闭

如何读取R中未知编码的csv数据 [英] How to read csv data with unknown encoding in R

问题描述

相关文章

Office最新文章

热门教程

热门工具

登录 关闭

登录关闭