fread()失败,整数值列中缺少值 [英] fread() fails with missing values in integer64 columns
问题描述
在阅读下面的文本时, fread()
无法检测到第8列和第9列中的缺失值。这仅在默认选项中使用integer64 = integer64
。设置 integer64 = double
或字符
可以正确检测到 NA
s。请注意,该文件在V8和V9中具有三种可能的NA:,
; ,,
;和 NA
。附加 na.strings = c( NA, N / A,,),sep =,
作为选项无效。
When reading the text below, fread()
fails to detect the missing values in columns 8 and 9. This is only with the default option integer64="integer64"
. Setting integer64="double"
or "character"
correctly detects NA
s. Note that the file has three types of possible NAs in V8 and V9-- ,,
; , ,
; and NA
. Appending na.strings=c("NA","N/A",""," "), sep=","
as options has no effect.
使用 read.csv()
与 fread(integer = double )
。
要读取的文本(也可作为文件整数64_and_NA.csv ):
2012,276,,0,"S1","001",1,,724135215,1590915056,
2012,276,2,8,"S1","001",1, ,,154598,0
2012,276,2,12,"S1","001",1,NA,5118863,21819477,
2012,276,2,0,"S1","011",8,3127133583,3127133583,9003982501,0
这是 fread()
的输出:
DT <- fread(input="integer64_and_NA.csv", verbose=TRUE, integer64="integer64", na.strings=c("NA","N/A",""," "), sep=",")
Input contains no \n. Taking this to be a filename to open
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 4 (the last non blank line in the first 'autostart') ... found ok
Found 11 columns
First row with 11 fields occurs on line 1 (either column names or first row of data)
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 5
Subtracted 1 for last eol and any trailing empty lines, leaving 4 data rows
Type codes: 11114412221 (first 5 rows)
Type codes: 11114412221 (after applying colClasses and integer64)
Type codes: 11114412221 (after applying drop or select (if supplied)
Allocating 11 column slots (11 - 0 NULL)
0.000s ( 0%) Memory map (rerun may be quicker)
0.000s ( 0%) sep and header detection
0.000s ( 0%) Count rows (wc -l)
0.000s ( 0%) Column type detection (first, middle and last 5 rows)
0.000s ( 0%) Allocation of 4x11 result (xMB) in RAM
0.000s ( 0%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 0%) Changing na.strings to NA
0.001s Total
结果数据表为:
DT
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1: 2012 276 NA 0 S1 001 1 9218868437227407266 724135215 1590915056 NA
2: 2012 276 2 8 S1 001 1 9218868437227407266 9218868437227407266 154598 0
3: 2012 276 2 12 S1 001 1 9218868437227407266 5118863 21819477 NA
4: 2012 276 2 0 S1 011 8 3127133583 3127133583 9003982501 0
NA
值可以在不是 integer64
。对于V8和V9,其 fread()
标记为integer64,而不是NA,我们使用的是 9218868437227407266。
有趣的是, str()
将V8和V9的相应值返回为 NA
:
NA
values are properly detected in columns which are not integer64
. For V8 and V9, which fread()
marks as integer64, instead of NAs we have "9218868437227407266".
Interestingly enough, str()
returns the respective values of V8 and V9 as NA
:
str(DT)
Classes ‘data.table’ and 'data.frame': 4 obs. of 11 variables:
$ V1 : int 2012 2012 2012 2012
$ V2 : int 276 276 276 276
$ V3 : int NA 2 2 2
$ V4 : int 0 8 12 0
$ V5 : chr "S1" "S1" "S1" "S1"
$ V6 : chr "001" "001" "001" "011"
$ V7 : int 1 1 1 8
$ V8 :Class 'integer64' num [1:4] NA NA NA 1.55e-314
$ V9 :Class 'integer64' num [1:4] 3.58e-315 NA 2.53e-317 1.55e-314
$ V10:Class 'integer64' num [1:4] 7.86e-315 7.64e-319 1.08e-316 4.45e-314
$ V11: int NA 0 NA 0
- attr(*, ".internal.selfref")=<externalptr>
...但是没有其他人将其视为 NA
:
... but nothing else sees them as NA
:
is.na(DT$V8)
[1] FALSE FALSE FALSE FALSE
max(DT$V8)
integer64
[1] 9218868437227407266
> max(DT$V8, na.rm=TRUE)
integer64
[1] 9218868437227407266
> class(DT$V8)
[1] "integer64"
> typeof(DT$V8)
[1] "double"
仅是打印/屏幕问题, data.table
会将它们视为巨大的整数:
It does not seem to be a print/screen issue only, data.table
sees them as huge integers:
DT[, V12:=as.numeric(V8)]
Warning message:
In as.double.integer64(V8) :
integer precision lost while converting to double
> DT
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1: 2012 276 NA 0 S1 001 1 9218868437227407266 724135215 1590915056 NA 9.218868e+18
2: 2012 276 2 8 S1 001 1 9218868437227407266 9218868437227407266 154598 0 9.218868e+18
3: 2012 276 2 12 S1 001 1 9218868437227407266 5118863 21819477 NA 9.218868e+18
4: 2012 276 2 0 S1 011 8 3127133583 3127133583 9003982501 0 3.127134e+09
我缺少有关 integer64
的东西,还是这是一个错误?如上所述,我可以使用 integer64 = double
来解决问题,这可能会失去一些精度,如帮助文件中所述。但是意外行为是默认的 integer64
...
Am I missing something about integer64
, or is this a bug? As said above, I can get around using integer64="double"
, possibly losing some precision, as mentioned in the help file. But the unexpected behavior is with the default integer64
...
这是在Windows 8.1 64位上完成的运行Revolution R 3.0.2的计算机以及运行kubuntu 13.10,CRAN-R 3.0.2的虚拟机。已通过CRAN(截至2014年2月7日为1.8.10)和1.8.11(rev.1110,2014-02-04 02:43:19)的最新稳定数据表进行测试,并从zip中手动安装为r-forge在Windows上,版本已损坏),在Linux上,仅稳定版1.8.10。
This was done on a Windows 8.1 64-bit machine running Revolution R 3.0.2, and also on a virtual machine running kubuntu 13.10, CRAN-R 3.0.2. Tested with the latest stable data.table from CRAN (1.8.10 as of 7 Feb 2014) and 1.8.11 (rev. 1110, 2014-02-04 02:43:19, manually installed from the zip as the r-forge build is broken) on Windows, and only the stable 1.8.10 on linux. bit64 is installed and loaded on both machines.
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] bit64_0.9-3 bit_1.1-11 gdata_2.13.2 xts_0.9-7 zoo_1.7-10 nlme_3.1-113 hexbin_1.26.3 lattice_0.20-24 ggplot2_0.9.3.1
[10] plyr_1.8 reshape2_1.2.2 data.table_1.8.11 Revobase_7.0.0 RevoMods_7.0.0 RevoScaleR_7.0.0
loaded via a namespace (and not attached):
[1] codetools_0.2-8 colorspace_1.2-4 dichromat_2.0-0 digest_0.6.4 foreach_1.4.1 gtable_0.1.2 gtools_3.2.1 iterators_1.0.6
[9] labeling_0.2 MASS_7.3-29 munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5 reshape_0.8.4 scales_0.2.3 stringr_0.6.2
[17] tools_3.0.2
推荐答案
这显然是bit64软件包的问题,而不是 fread( )
或 data.table
。从 bit64
文档 http://cran.r-project.org/web/packages/bit64/bit64.pdf
This apparently is an issue with the bit64 package, not fread()
or data.table
. From the bit64
documentation http://cran.r-project.org/web/packages/bit64/bit64.pdf
对不存在的元素进行下标并使用NA进行下标目前不支持。这种下标当前返回9218868437227407266而不是NA(底层双码的NA值)。遵循完全R行为会破坏性能或需要大量的C编码。
"Subscripting non-existing elements and subscripting with NAs is currently not supported. Such subscripting currently returns 9218868437227407266 instead of NA (the NA value of the un-derlying double code). Following the full R behaviour here would either destroy performance or require extensive C-coding."
我尝试将9218868437227407266的值重新分配给NA,认为它可以工作
I tried reassigning the 9218868437227407266 value to NA thinking it would work
Ex。
DT[V8==9218868437227407266, ]
#actually returns nothing, but
DT[V8==max(V8), ]
#returns the rows with 9218868437227407266 in V8
#but this does not reassign the value
DT[V8==max(V8), V8:=NA]
#not that this makes sense, but I tried just in case...
DT[V8==max(V8), V8:=NA_character_]
因此,正如文档中明确指出的那样,如果向量是integer64类,它将无法识别NA或缺少值。我将避免使用bit64只是不必处理此问题。
So as the documentation pretty clearly states, if a vector is class integer64 it won't recognize NA or missing values. I've going to avoid bit64 just to not have to deal with this...
这篇关于fread()失败,整数值列中缺少值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!