如何检测 read.csv 的正确编码? [英] How to detect the right encoding for read.csv?

查看:14
本文介绍了如何检测 read.csv 的正确编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个文件 (http://b7hq6v.alterupload.com/en/),我想用 read.csv 在 R 中读取它.但我无法检测到正确的编码.它似乎是一种UTF-8.我在 WindowsXP 机器上使用 R 2.12.1.有什么帮助吗?

I have this file (http://b7hq6v.alterupload.com/en/) that I want to read in R with read.csv. But I am not able to detect the correct encoding. It seems to be a kind of UTF-8. I am using R 2.12.1 on an WindowsXP Machine. Any Help?

推荐答案

首先 基于 StackOverflow 上更普遍的问题 不可能 100% 确定地检测文件的编码.

First of all based on more general question on StackOverflow it is not possible to detect encoding of file in 100% certainty.

我已经为此挣扎了很多次,最终还是找到了非自动解决方案:

I've struggle this many times and come to non-automatic solution:

使用 iconvlist 获取所有可能的编码:

Use iconvlist to get all possible encodings:

codepages <- setNames(iconvlist(), iconvlist())

然后使用它们读取数据

x <- lapply(codepages, function(enc) try(read.table("encoding.asc",
                   fileEncoding=enc,
                   nrows=3, header=TRUE, sep="	"))) # you get lots of errors/warning here

这里重要的是要了解文件的结构(分隔符、标题).使用 fileEncoding 参数设置编码.仅读取几行.
现在您可以查找结果:

Important here is to know structure of file (separator, headers). Set encoding using fileEncoding argument. Read only few rows.
Now you could lookup on results:

unique(do.call(rbind, sapply(x, dim)))
#        [,1] [,2]
# 437       14    2
# CP1200     3   29
# CP12000    0    1

看起来正确的是 3 行 29 列,让我们看看它们:

Seems like correct one is that with 3 rows and 29 columns, so lets see them:

maybe_ok <- sapply(x, function(x) isTRUE(all.equal(dim(x), c(3,29))))
codepages[maybe_ok]
#    CP1200    UCS-2LE     UTF-16   UTF-16LE      UTF16    UTF16LE 
#  "CP1200"  "UCS-2LE"   "UTF-16" "UTF-16LE"    "UTF16"  "UTF16LE" 

你也可以查看数据

x[maybe_ok]

对于您的文件,所有这些编码都返回相同的数据(部分原因是您看到了一些冗余).

For your file all this encodings returns identical data (partially because there is some redundancy as you see).

如果您不知道具体的文件,您需要使用 readLines 并在工作流程中进行一些更改(例如,您不能使用 fileEncoding,必须使用 length 而不是 dim,做更多的魔术来找到正确的).

If you don't know specific of your file you need to use readLines with some changes in workflow (e.g. you can't use fileEncoding, must use length instead of dim, do more magic to find correct ones).

这篇关于如何检测 read.csv 的正确编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆