摆脱 SAS 和 R 之间的 BOM [英] Getting rid of BOM between SAS and R
问题描述
我使用 SAS 在 Windows 机器上保存了一个带有 utf8
编码的制表符分隔的文本文件.然后我尝试在 R 中打开它:
I used SAS to save a tab-delimited text file with utf8
encoding on a windows machine. Then I tried to open this in R:
read.table(myfile, header =TRUE, sep = "\t")
令我惊讶的是,数据完全混乱,但只是偷偷摸摸.数值随机变化,但整体布局看起来正常,所以我花了一段时间才注意到这个问题,我现在假设是 物料清单.
To my surprise, the data was totally messed up, but only in a sneaky way. Number values changed randomly, but the overall layout looked normal, so it took me a while to notice the problem, which I'm assuming now is the BOM.
这当然不是新问题;他们在这里简要地解决了这个问题,并推荐使用
This is not a new issue of course; they address it briefly here, and recommend using
read.table(myfile, fileEncoding = "UTF-8", header =TRUE, sep = "\t")
然而,这并没有改善!我唯一的解决方案是抑制标题,有或没有 fileEncoding
参数:
However, this made no improvement! My only solution was to suppress the header, with or without the fileEncoding
argument:
read.table(myfile, fileEncoding = "UTF-8", header =FALSE, sep = "\t")
read.table(myfile, header =FALSE, sep = "\t")
在任何一种情况下,我都必须做一些有趣的事情来用第一行替换列名,但只有在我删除出现在第一列名开头的某个版本的 BOM 之后(<U+FEFF>
如果我使用 fileEncoding
和
如果我不使用 fileEncoding
).
In either case, I have to do some funny business to replace the column names with the first row, but only after I remove some version of the BOM that appears at the beginning of the first column name (<U+FEFF>
if I use fileEncoding
and

if I don't use fileEncoding
).
难道没有一种简单的方法可以删除 BOM 并使用 read.table
而无需任何特殊参数吗?
Isn't there a simple way to just remove the BOM and use read.table
without any special arguments?
@Joe 的更新:我使用的 SAS:
FILENAME myfile 'C:\Documents ... file.txt' encoding="utf-8";
proc export data=lib.sastable
outfile=myfile
dbms=tab replace;
putnames=yes;
run;
关于进一步奇怪的更新: 使用 fileEncoding="UTF-8-BOM"
作为@Joe 在下面的解决方案中建议的似乎删除了 BOM.然而,它并没有解决我最初的激励问题,即数据损坏;标题行很好,但奇怪的是第一列数字的最后几位数字被弄乱了.我会感谢 Joe 的回答——也许我的问题实际上不是 BOM 问题?
Update on further weirdness: Using fileEncoding="UTF-8-BOM"
as @Joe suggested in his solution below seems to remove the BOM. However, it did not fix my original motivating problem, which is corruption in the data; the header row is fine, but weirdly the last few digits of the first column of numbers gets messed up. I'll give Joe credit for his answer -- maybe my problem is not actually a BOM issue?
Hack 解决方案: 使用 fileEncoding="UTF-8-BOM"
并且还包括参数 colClasses = "character"
.不知道为什么这可以解决数据损坏问题 - 可能是未来问题的主题.
Hack solution: Use fileEncoding="UTF-8-BOM"
AND also include the argument colClasses = "character"
. No idea why this works to fix the data corruption issue -- could be the topic of a future question.
推荐答案
根据您的链接,它看起来对我有用:
As per your link, it looks like it works for me with:
read.table('c:\\temp\\testfile.txt',fileEncoding='UTF-8-BOM',header=TRUE,sep='\t')
注意文件编码中的 -BOM.
note the -BOM in the file encoding.
这是在 2.1r 文档中 read.table 的变化.在 12 编码下,请参阅在 UNIX 下您可能需要...",这显然现在甚至适用于 Windows(至少对我而言).
This is in 2.1 Variations on read.table in the r documentation. Under 12 Encoding, see "Under UNIX you might need...", which apparently applies even on Windows now (for me, at least).
这篇关于摆脱 SAS 和 R 之间的 BOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!