data.table :: fread不喜欢第一列中的缺少值 [英] data.table::fread doesn't like missing values in first column
问题描述
这是 data.table :: fread
(1.9.2版本)或位置错误的用户期望/错误中的错误吗?
$ b考虑这个简单的例子,我有一个值表,
TAB
分隔与可能缺少的值。如果第一列中缺少这些值, fread
会打乱,但如果缺少值,则返回 data.table
我期望: #第一列,第三行和最后一列,第二行缺少值的数据:
12 876 19
23 39
15 20
fread(12 876 19
23 39
15 20)
#Error in fread(12 \t876\t19\\\
23\t39\t\\\
\t15\t20):
#在测试标题行的格式后未正确定位。 ch =''
#缺少值最后一列,第二和第三行的数据:
12 876 19
23 39
15 20
fread(12 876 19
23 39
15 20)
#V1 V2 V3
#1:12 876 19
#2:23 39 NA
#3:15 20 NA
#按预期返回。
这是一个错误或者是不可能在第一列中有缺少的值
我相信这是我报告的同一个错误此处。我知道的最新版本将使用这种类型的输入是Rev. 1180.您可以 通过添加 @ 1180
到 svn checkout
命令的结尾。
svn checkout svn://svn.r-forge.r-project.org/svnroot/datatable/@1180
如果您不熟悉检查和构建包,见这里
但,许多伟大的功能,错误修复, Rev. 1180.(在撰写本文时,deveolpment版本是Rev. 1272)。所以,一个更好的解决方案是用 R / fread.R
和 src / fread.c
版本从Rev. 1180或更早,然后重新构建包。
您可以在线查找这些文件,而不必在这里检查它们(对不起,我不知道如何发布包含*的链接,因此您必须复制/ paste):
fread.R:
http://r-forge.r-project.org/ scm / viewvc.php / * checkout * / pkg / R / fread.R?revision = 988& root = datatable
fread.c :
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/src/fread.c?revision=1159& root = datatable
重建包后,即可读取tsv文件。
> fread(12 \t876\t19\\\
23\t39\t\\\
\t15\t20)
V1 V2 V3
1:12 876 19
2:23 39 NA
3:NA 15 20
这样做的缺点是旧版本 fread()
不会通过较新的测试 - 您将无法读取中间有引号的字段。
> fread('A,B,C \\\
1.2,FooBar,ab \cd\\\
foo,bar,b,az\\\
)
fread中出错(A,B,C \\\
1.2,Foo \Bar,\a \b \c \d \\\\
fo\o ,bar,\b,az \\\\\
):
在测试标题行的格式后未正确定位。使用较新版本的 fread $ c $ c>,你会得到这个 > fread('A,B,C \\\
1.2,FooBar,ab \cd\\\
foo,bar,b,az\\\
)
ABC
1:1.2 FooBar abcd
2:foo bar b,az
因此,现在,哪个版本工作取决于您是否更可能在第一列中缺少值,或字段中的引号。对我来说,它是前者,所以我仍然使用旧的代码。
Is this a bug in data.table::fread
(version 1.9.2) or misplaced user expectation/error?
Consider this trivial example where I have a table of values, TAB
separated with possibly missing values. If the values are missing in the first column, fread
gets upset, but if missing values are elsewhere I return the data.table
I expect:
# Data with missing value in first column, third row and last column, second row:
12 876 19
23 39
15 20
fread("12 876 19
23 39
15 20")
#Error in fread("12\t876\t19\n23\t39\t\n\t15\t20") :
# Not positioned correctly after testing format of header row. ch=' '
# Data with missing values last column, rows two and three:
"12 876 19
23 39
15 20 "
fread( "12 876 19
23 39
15 20 " )
# V1 V2 V3
#1: 12 876 19
#2: 23 39 NA
#3: 15 20 NA
# Returns as expected.
Is this a bug or is it not possible to have missing values in the first column (or do I have malformed data somehow?).
解决方案 I believe this is the same bug that I reported here.
The most recent version that I know will work with this type of input is Rev. 1180. You could checkout and build that version by adding @1180
to the end of the svn checkout
command.
svn checkout svn://svn.r-forge.r-project.org/svnroot/datatable/@1180
If you're not familiar with checking out and building packages, see here
But, a lot of great features, bug fixes, enhancements have been implemented since Rev. 1180. (The deveolpment version at the time of this writing is Rev. 1272). So, a better solution, is to replace the R/fread.R
and src/fread.c
files with the versions from Rev. 1180 or older, and then re-building the package.
You can find those files online without checking them out here (sorry, I can't figure out how to post links that include '*', so you have to copy/paste):
fread.R:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/R/fread.R?revision=988&root=datatable
fread.c:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/src/fread.c?revision=1159&root=datatable
Once you've rebuilt the package, you'll be able to read your tsv file.
> fread("12\t876\t19\n23\t39\t\n\t15\t20")
V1 V2 V3
1: 12 876 19
2: 23 39 NA
3: NA 15 20
The downside to doing this is that the old version of fread()
does not pass a newer test -- you won't be able to read fields that have quotes in the middle.
> fread('A,B,C\n1.2,Foo"Bar,"a"b\"c"d"\nfo"o,bar,"b,az""\n')
Error in fread("A,B,C\n1.2,Foo\"Bar,\"a\"b\"c\"d\"\nfo\"o,bar,\"b,az\"\"\n") :
Not positioned correctly after testing format of header row. ch=','
With newer versions of fread
, you would get this
> fread('A,B,C\n1.2,Foo"Bar,"a"b\"c"d"\nfo"o,bar,"b,az""\n')
A B C
1: 1.2 Foo"Bar a"b"c"d
2: fo"o bar b,az"
So, for now, which version "works" depends on whether you're more likely to have missing values in the first column, or quotes in fields. For me, it's the former, so I'm still using the old code.
这篇关于data.table :: fread不喜欢第一列中的缺少值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!