fread()失败,整数值列中缺少值 [英] fread() fails with missing values in integer64 columns

查看:80
本文介绍了fread()失败,整数值列中缺少值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在阅读下面的文本时, fread()无法检测到第8列和第9列中的缺失值。这仅在默认选项中使用integer64 = integer64 。设置 integer64 = double 字符 可以正确检测到 NA s。请注意,该文件在V8和V9中具有三种可能的NA:; ,,;和 NA 。附加 na.strings = c( NA, N / A,,),sep =, 作为选项无效。

When reading the text below, fread() fails to detect the missing values in columns 8 and 9. This is only with the default option integer64="integer64". Setting integer64="double" or "character" correctly detects NAs. Note that the file has three types of possible NAs in V8 and V9-- ,,; , ,; and NA. Appending na.strings=c("NA","N/A",""," "), sep="," as options has no effect.

使用 read.csv() fread(integer = double )

要读取的文本(也可作为文件整数64_and_NA.csv ):

2012,276,,0,"S1","001",1,,724135215,1590915056,
2012,276,2,8,"S1","001",1, ,,154598,0
2012,276,2,12,"S1","001",1,NA,5118863,21819477,
2012,276,2,0,"S1","011",8,3127133583,3127133583,9003982501,0

这是 fread()的输出:

DT <- fread(input="integer64_and_NA.csv", verbose=TRUE, integer64="integer64", na.strings=c("NA","N/A",""," "), sep=",")

Input contains no \n. Taking this to be a filename to open
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 4 (the last non blank line in the first 'autostart') ... found ok
Found 11 columns
First row with 11 fields occurs on line 1 (either column names or first row of data)
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 5
Subtracted 1 for last eol and any trailing empty lines, leaving 4 data rows
Type codes: 11114412221 (first 5 rows)
Type codes: 11114412221 (after applying colClasses and integer64)
Type codes: 11114412221 (after applying drop or select (if supplied)
Allocating 11 column slots (11 - 0 NULL)
   0.000s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.000s (  0%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.000s (  0%) Allocation of 4x11 result (xMB) in RAM
   0.000s (  0%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.001s        Total

结果数据表为:

DT
     V1  V2 V3 V4 V5  V6 V7                  V8                  V9        V10 V11
1: 2012 276 NA  0 S1 001  1 9218868437227407266           724135215 1590915056  NA
2: 2012 276  2  8 S1 001  1 9218868437227407266 9218868437227407266     154598   0
3: 2012 276  2 12 S1 001  1 9218868437227407266             5118863   21819477  NA
4: 2012 276  2  0 S1 011  8          3127133583          3127133583 9003982501   0

NA 值可以在不是 integer64 。对于V8和V9,其 fread()标记为integer64,而不是NA,我们使用的是 9218868437227407266。
有趣的是, str()将V8和V9的相应值返回为 NA

NA values are properly detected in columns which are not integer64. For V8 and V9, which fread() marks as integer64, instead of NAs we have "9218868437227407266". Interestingly enough, str() returns the respective values of V8 and V9 as NA:

str(DT)

Classes ‘data.table’ and 'data.frame':  4 obs. of  11 variables:
 $ V1 : int  2012 2012 2012 2012
 $ V2 : int  276 276 276 276
 $ V3 : int  NA 2 2 2
 $ V4 : int  0 8 12 0
 $ V5 : chr  "S1" "S1" "S1" "S1"
 $ V6 : chr  "001" "001" "001" "011"
 $ V7 : int  1 1 1 8
 $ V8 :Class 'integer64'  num [1:4] NA NA NA 1.55e-314
 $ V9 :Class 'integer64'  num [1:4] 3.58e-315 NA 2.53e-317 1.55e-314
 $ V10:Class 'integer64'  num [1:4] 7.86e-315 7.64e-319 1.08e-316 4.45e-314
 $ V11: int  NA 0 NA 0
 - attr(*, ".internal.selfref")=<externalptr> 

...但是没有其他人将其视为 NA

... but nothing else sees them as NA:

is.na(DT$V8)
[1] FALSE FALSE FALSE FALSE
max(DT$V8)
integer64
[1] 9218868437227407266
> max(DT$V8, na.rm=TRUE)
integer64
[1] 9218868437227407266
> class(DT$V8)
[1] "integer64"
> typeof(DT$V8)
[1] "double"

仅是打印/屏幕问题, data.table 会将它们视为巨大的整数:

It does not seem to be a print/screen issue only, data.table sees them as huge integers:

DT[, V12:=as.numeric(V8)]
Warning message:
In as.double.integer64(V8) :
  integer precision lost while converting to double
> DT
     V1  V2 V3 V4 V5  V6 V7                  V8                  V9        V10 V11          V12
1: 2012 276 NA  0 S1 001  1 9218868437227407266           724135215 1590915056  NA 9.218868e+18
2: 2012 276  2  8 S1 001  1 9218868437227407266 9218868437227407266     154598   0 9.218868e+18
3: 2012 276  2 12 S1 001  1 9218868437227407266             5118863   21819477  NA 9.218868e+18
4: 2012 276  2  0 S1 011  8          3127133583          3127133583 9003982501   0 3.127134e+09

我缺少有关 integer64 的东西,还是这是一个错误?如上所述,我可以使用 integer64 = double 来解决问题,这可能会失去一些精度,如帮助文件中所述。但是意外行为是默认的 integer64 ...

Am I missing something about integer64, or is this a bug? As said above, I can get around using integer64="double", possibly losing some precision, as mentioned in the help file. But the unexpected behavior is with the default integer64...

这是在Windows 8.1 64位上完成的运行Revolution R 3.0.2的计算机以及运行kubuntu 13.10,CRAN-R 3.0.2的虚拟机。已通过CRAN(截至2014年2月7日为1.8.10)和1.8.11(rev.1110,2014-02-04 02:43:19)的最新稳定数据表进行测试,并从zip中手动安装为r-forge在Windows上,版本已损坏),在Linux上,仅稳定版1.8.10。

This was done on a Windows 8.1 64-bit machine running Revolution R 3.0.2, and also on a virtual machine running kubuntu 13.10, CRAN-R 3.0.2. Tested with the latest stable data.table from CRAN (1.8.10 as of 7 Feb 2014) and 1.8.11 (rev. 1110, 2014-02-04 02:43:19, manually installed from the zip as the r-forge build is broken) on Windows, and only the stable 1.8.10 on linux. bit64 is installed and loaded on both machines.

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bit64_0.9-3       bit_1.1-11        gdata_2.13.2      xts_0.9-7         zoo_1.7-10        nlme_3.1-113      hexbin_1.26.3     lattice_0.20-24   ggplot2_0.9.3.1  
[10] plyr_1.8          reshape2_1.2.2    data.table_1.8.11 Revobase_7.0.0    RevoMods_7.0.0    RevoScaleR_7.0.0 

loaded via a namespace (and not attached):
 [1] codetools_0.2-8    colorspace_1.2-4   dichromat_2.0-0    digest_0.6.4       foreach_1.4.1      gtable_0.1.2       gtools_3.2.1       iterators_1.0.6   
 [9] labeling_0.2       MASS_7.3-29        munsell_0.4.2      proto_0.3-10       RColorBrewer_1.0-5 reshape_0.8.4      scales_0.2.3       stringr_0.6.2     
[17] tools_3.0.2      


推荐答案

这显然是bit64软件包的问题,​​而不是 fread( ) data.table 。从 bit64 文档 http://cran.r-project.org/web/packages/bit64/bit64.pdf

This apparently is an issue with the bit64 package, not fread() or data.table. From the bit64 documentation http://cran.r-project.org/web/packages/bit64/bit64.pdf

对不存在的元素进行下标并使用NA进行下标目前不支持。这种下标当前返回9218868437227407266而不是NA(底层双码的NA值)。遵循完全R行为会破坏性能或需要大量的C编码。

"Subscripting non-existing elements and subscripting with NAs is currently not supported. Such subscripting currently returns 9218868437227407266 instead of NA (the NA value of the un-derlying double code). Following the full R behaviour here would either destroy performance or require extensive C-coding."

我尝试将9218868437227407266的值重新分配给NA,认为它可以工作

I tried reassigning the 9218868437227407266 value to NA thinking it would work

Ex。

DT[V8==9218868437227407266, ]
#actually returns nothing, but
DT[V8==max(V8), ]
#returns the rows with 9218868437227407266 in V8
#but this does not reassign the value 
DT[V8==max(V8), V8:=NA]
#not that this makes sense, but I tried just in case...
DT[V8==max(V8), V8:=NA_character_]

因此,正如文档中明确指出的那样,如果向量是integer64类,它将无法识别NA或缺少值。我将避免使用bit64只是不必处理此问题。

So as the documentation pretty clearly states, if a vector is class integer64 it won't recognize NA or missing values. I've going to avoid bit64 just to not have to deal with this...

这篇关于fread()失败,整数值列中缺少值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆