如何检测read.csv的正确编码? [英] How to detect the right encoding for read.csv?

查看:451
本文介绍了如何检测read.csv的正确编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个文件(http://b7hq6v.alterupload.com/en/),我想用R读取 read.csv 。但我无法检测到正确的编码。它似乎是一种UTF-8。我在WindowsXP机器上使用R 2.12.1。
任何帮助?

I have this file (http://b7hq6v.alterupload.com/en/) that I want to read in R with read.csv. But I am not able to detect the correct encoding. It seems to be a kind of UTF-8. I am using R 2.12.1 on an WindowsXP Machine. Any Help?

推荐答案

首先基于StackOverflow的更普遍的问题,不可能100%确定地检测文件的编码。

First of all based on more general question on StackOverflow it is not possible to detect encoding of file in 100% certainty.

我遇到这种情况很多次,并来到非自动解决方案:

I've struggle this many times and come to non-automatic solution:

使用 iconvlist 以获得所有可能的编码:

Use iconvlist to get all possible encodings:

codepages <- setNames(iconvlist(), iconvlist())

然后使用它们读取数据

x <- lapply(codepages, function(enc) try(read.table("encoding.asc",
                   fileEncoding=enc,
                   nrows=3, header=TRUE, sep="\t"))) # you get lots of errors/warning here


b $ b

这里重要的是知道文件的结构(分隔符,标题)。使用 fileEncoding 参数设置编码。只读几行。

现在您可以查找结果:

Important here is to know structure of file (separator, headers). Set encoding using fileEncoding argument. Read only few rows.
Now you could lookup on results:

unique(do.call(rbind, sapply(x, dim)))
#        [,1] [,2]
# 437       14    2
# CP1200     3   29
# CP12000    0    1

看起来像正确的一行是3行29列,让我们看看:

Seems like correct one is that with 3 rows and 29 columns, so lets see them:

maybe_ok <- sapply(x, function(x) isTRUE(all.equal(dim(x), c(3,29))))
codepages[maybe_ok]
#    CP1200    UCS-2LE     UTF-16   UTF-16LE      UTF16    UTF16LE 
#  "CP1200"  "UCS-2LE"   "UTF-16" "UTF-16LE"    "UTF16"  "UTF16LE" 

您也可以查看数据

x[maybe_ok]

编码返回相同的数据(部分是因为你看到一些冗余)。

For your file all this encodings returns identical data (partially because there is some redundancy as you see).

如果你不知道你的文件的具体情况,你需要使用 fileEncoding ,必须使用 length c>而不是 dim ,更多的魔法来找到正确的)。

If you don't know specific of your file you need to use readLines with some changes in workflow (e.g. you can't use fileEncoding, must use length instead of dim, do more magic to find correct ones).

这篇关于如何检测read.csv的正确编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆