data.table::fread 不喜欢第一列中的缺失值 [英] data.table::fread doesn't like missing values in first column

查看:12
本文介绍了data.table::fread 不喜欢第一列中的缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是 data.table::fread(版本 1.9.2)中的错误还是错误的用户期望/错误?

Is this a bug in data.table::fread (version 1.9.2) or misplaced user expectation/error?

考虑这个简单的例子,我有一个值表,TAB 用可能缺失的值分隔.如果第一列中缺少值, fread 会感到不安,但如果在其他地方缺少值,我会返回 data.table 我期望:

Consider this trivial example where I have a table of values, TAB separated with possibly missing values. If the values are missing in the first column, fread gets upset, but if missing values are elsewhere I return the data.table I expect:

# Data with missing value in first column, third row and last column, second row:
12  876 19
23  39  
    15  20

fread("12   876 19
23  39  
    15  20")
#Error in fread("12	876	19
23	39	
	15	20") : 
#  Not positioned correctly after testing format of header row. ch='    '

# Data with missing values last column, rows two and three: 
"12 876 19
23  39  
15  20  "

fread( "12  876 19
23  39  
15  20  " )
#   V1  V2 V3
#1: 12 876 19
#2: 23  39 NA
#3: 15  20 NA
# Returns as expected.

这是一个错误,还是第一列中不可能有缺失值(或者我的数据是否存在格式错误?).

Is this a bug or is it not possible to have missing values in the first column (or do I have malformed data somehow?).

推荐答案

我相信这与我报告的错误相同 这里.

I believe this is the same bug that I reported here.

我知道可以使用这种类型输入的最新版本是 Rev. 1180.您可以通过在末尾添加 @1180 来签出并构建该版本svn checkout 命令.

The most recent version that I know will work with this type of input is Rev. 1180. You could checkout and build that version by adding @1180 to the end of the svn checkout command.

svn checkout svn://svn.r-forge.r-project.org/svnroot/datatable/@1180

如果您不熟悉签出和构建包,看这里

If you're not familiar with checking out and building packages, see here

但是,自 1180 版以来已经实现了许多出色的功能、错误修复和增强功能.(撰写本文时的开发版本是 1272 版).因此,更好的解决方案是将 R/fread.Rsrc/fread.c 文件替换为 Rev. 1180 或更早的版本,然后重新构建包.

But, a lot of great features, bug fixes, enhancements have been implemented since Rev. 1180. (The deveolpment version at the time of this writing is Rev. 1272). So, a better solution, is to replace the R/fread.R and src/fread.c files with the versions from Rev. 1180 or older, and then re-building the package.

您可以在线找到这些文件而无需在此处查看(抱歉,我不知道如何发布包含*"的链接,因此您必须复制/粘贴):

You can find those files online without checking them out here (sorry, I can't figure out how to post links that include '*', so you have to copy/paste):

fread.R:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/R/fread.R?revision=988&root=datatable

fread.c:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/src/fread.c?revision=1159&root=datatable

一旦你重建了包,你就可以读取你的 tsv 文件了.

Once you've rebuilt the package, you'll be able to read your tsv file.

> fread("12	876	19
23	39	
	15	20")
   V1  V2 V3
1: 12 876 19
2: 23  39 NA
3: NA  15 20

这样做的缺点是旧版本的 fread() 无法通过较新的测试——您将无法读取中间有引号的字段.

The downside to doing this is that the old version of fread() does not pass a newer test -- you won't be able to read fields that have quotes in the middle.

> fread('A,B,C
1.2,Foo"Bar,"a"b"c"d"
fo"o,bar,"b,az""
')
Error in fread("A,B,C
1.2,Foo"Bar,"a"b"c"d"
fo"o,bar,"b,az""
") : 
  Not positioned correctly after testing format of header row. ch=','

使用较新版本的 fread,你会得到这个

With newer versions of fread, you would get this

> fread('A,B,C
1.2,Foo"Bar,"a"b"c"d"
fo"o,bar,"b,az""
')
      A       B       C
1:  1.2 Foo"Bar a"b"c"d
2: fo"o     bar   b,az"

因此,目前,哪个版本有效"取决于您是否更有可能在第一列中缺少值,或者在字段中出现引号.对我来说,是前者,所以我还在使用旧代码.

So, for now, which version "works" depends on whether you're more likely to have missing values in the first column, or quotes in fields. For me, it's the former, so I'm still using the old code.

这篇关于data.table::fread 不喜欢第一列中的缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆