如何检测read.csv的正确编码? [英] How to detect the right encoding for read.csv?
问题描述
我有这个文件(http://b7hq6v.alterupload.com/en/),我想用R读取 read.csv
。但我无法检测到正确的编码。它似乎是一种UTF-8。我在WindowsXP机器上使用R 2.12.1。
任何帮助?
I have this file (http://b7hq6v.alterupload.com/en/) that I want to read in R with read.csv
. But I am not able to detect the correct encoding. It seems to be a kind of UTF-8. I am using R 2.12.1 on an WindowsXP Machine.
Any Help?
推荐答案
首先基于StackOverflow的更普遍的问题,不可能100%确定地检测文件的编码。
First of all based on more general question on StackOverflow it is not possible to detect encoding of file in 100% certainty.
我遇到这种情况很多次,并来到非自动解决方案:
I've struggle this many times and come to non-automatic solution:
使用 iconvlist
以获得所有可能的编码:
Use iconvlist
to get all possible encodings:
codepages <- setNames(iconvlist(), iconvlist())
然后使用它们读取数据
x <- lapply(codepages, function(enc) try(read.table("encoding.asc",
fileEncoding=enc,
nrows=3, header=TRUE, sep="\t"))) # you get lots of errors/warning here
b $ b
这里重要的是知道文件的结构(分隔符,标题)。使用 fileEncoding
参数设置编码。只读几行。
现在您可以查找结果:
Important here is to know structure of file (separator, headers). Set encoding using fileEncoding
argument. Read only few rows.
Now you could lookup on results:
unique(do.call(rbind, sapply(x, dim)))
# [,1] [,2]
# 437 14 2
# CP1200 3 29
# CP12000 0 1
看起来像正确的一行是3行29列,让我们看看:
Seems like correct one is that with 3 rows and 29 columns, so lets see them:
maybe_ok <- sapply(x, function(x) isTRUE(all.equal(dim(x), c(3,29))))
codepages[maybe_ok]
# CP1200 UCS-2LE UTF-16 UTF-16LE UTF16 UTF16LE
# "CP1200" "UCS-2LE" "UTF-16" "UTF-16LE" "UTF16" "UTF16LE"
您也可以查看数据
x[maybe_ok]
编码返回相同的数据(部分是因为你看到一些冗余)。
For your file all this encodings returns identical data (partially because there is some redundancy as you see).
如果你不知道你的文件的具体情况,你需要使用 fileEncoding
,必须使用 length $ c $> c>而不是
dim
,更多的魔法来找到正确的)。
If you don't know specific of your file you need to use readLines
with some changes in workflow (e.g. you can't use fileEncoding
, must use length
instead of dim
, do more magic to find correct ones).
这篇关于如何检测read.csv的正确编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!