摆脱 SAS 和 R 之间的 BOM [英] Getting rid of BOM between SAS and R

查看:40
本文介绍了摆脱 SAS 和 R 之间的 BOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 SAS 在 Windows 机器上保存了一个带有 utf8 编码的制表符分隔的文本文件.然后我尝试在 R 中打开它:

I used SAS to save a tab-delimited text file with utf8 encoding on a windows machine. Then I tried to open this in R:

read.table(myfile, header =TRUE, sep = "\t")

令我惊讶的是,数据完全混乱,但只是偷偷摸摸.数值随机变化,但整体布局看起来正常,所以我花了一段时间才注意到这个问题,我现在假设是 物料清单.

To my surprise, the data was totally messed up, but only in a sneaky way. Number values changed randomly, but the overall layout looked normal, so it took me a while to notice the problem, which I'm assuming now is the BOM.

这当然不是新问题;他们在这里简要地解决了这个问题,并推荐使用

This is not a new issue of course; they address it briefly here, and recommend using

read.table(myfile, fileEncoding = "UTF-8", header =TRUE, sep = "\t")

然而,这并没有改善!我唯一的解决方案是抑制标题,有或没有 fileEncoding 参数:

However, this made no improvement! My only solution was to suppress the header, with or without the fileEncoding argument:

read.table(myfile, fileEncoding = "UTF-8", header =FALSE, sep = "\t")
read.table(myfile, header =FALSE, sep = "\t")

在任何一种情况下,我都必须做一些有趣的事情来用第一行替换列名,但只有在我删除出现在第一列名开头的某个版本的 BOM 之后(<U+FEFF> 如果我使用 fileEncoding 如果我不使用 fileEncoding).

In either case, I have to do some funny business to replace the column names with the first row, but only after I remove some version of the BOM that appears at the beginning of the first column name (<U+FEFF> if I use fileEncoding and  if I don't use fileEncoding).

难道没有一种简单的方法可以删除 BOM 并使用 read.table 而无需任何特殊参数吗?

Isn't there a simple way to just remove the BOM and use read.table without any special arguments?

@Joe 的更新:我使用的 SAS:

FILENAME myfile 'C:\Documents ... file.txt'  encoding="utf-8";
proc export data=lib.sastable
  outfile=myfile
  dbms=tab  replace;
  putnames=yes;
run;

关于进一步奇怪的更新: 使用 fileEncoding="UTF-8-BOM" 作为@Joe 在下面的解决方案中建议的似乎删除了 BOM.然而,它并没有解决我最初的激励问题,即数据损坏;标题行很好,但奇怪的是第一列数字的最后几位数字被弄乱了.我会感谢 Joe 的回答——也许我的问题实际上不是 BOM 问题?

Update on further weirdness: Using fileEncoding="UTF-8-BOM" as @Joe suggested in his solution below seems to remove the BOM. However, it did not fix my original motivating problem, which is corruption in the data; the header row is fine, but weirdly the last few digits of the first column of numbers gets messed up. I'll give Joe credit for his answer -- maybe my problem is not actually a BOM issue?

Hack 解决方案: 使用 fileEncoding="UTF-8-BOM" 并且还包括参数 colClasses = "character".不知道为什么这可以解决数据损坏问题 - 可能是未来问题的主题.

Hack solution: Use fileEncoding="UTF-8-BOM" AND also include the argument colClasses = "character". No idea why this works to fix the data corruption issue -- could be the topic of a future question.

推荐答案

根据您的链接,它看起来对我有用:

As per your link, it looks like it works for me with:

read.table('c:\\temp\\testfile.txt',fileEncoding='UTF-8-BOM',header=TRUE,sep='\t')

注意文件编码中的 -BOM.

note the -BOM in the file encoding.

这是在 2.1r 文档中 read.table 的变化.在 12 编码下,请参阅在 UNIX 下您可能需要...",这显然现在甚至适用于 Windows(至少对我而言).

This is in 2.1 Variations on read.table in the r documentation. Under 12 Encoding, see "Under UNIX you might need...", which apparently applies even on Windows now (for me, at least).

这篇关于摆脱 SAS 和 R 之间的 BOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆