R:子集期间丢失数据(data.table) [英] R: Losing data during subsetting (data.table)

查看:328
本文介绍了R:子集期间丢失数据(data.table)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试将我的data.table子集化时,我正在丢失数据。



这是文件从

写的.csv

 时间戳,日期,时间,SN,A.Ms.Amp,A.Ms.Vol,A.Ms.Watt,Pac 
2013-10 -01 12:00:00,2013-10-01,12:00:00,2110000001,23.04,465.43,10723,13544.5
2013-10-01 12:00:00,2013-10-01, 12:00:00,2110000002,7.81,474.16,3704,6860
2013-10-01 12:00:00,2013-10-01,12:00:00,2110000003,6.97,484.19,3374, 6661
2013-10-01 12:05:00,2013-10-01,12:05:00,2110000001,23.19,467.05,10830,13576
2013-10-01 12:05: 00,2013-10-01,12:05:00,2110000002,8.4,462.52,3883.5,7366.5
2013-10-01 12:05:00,2013-10-01,12:05:00, 2110000003,7.72,470.6,3631,7169
2013-10-01 12:10:00,2013-10-01,12:10:00,2110000001,23.98,470.29,11278.5,14127.5
2013 -10-01 12:10:00,2013-10-01,12:10:00,2110000002,8.62,458.47,3952,7475.5
2013-10-01 12:10:00,2013-10- 01,12:10:00,2110000003,7.9,462.62,3654,7182.33
2013-10-01 12:15:00,2013-10-01,12:15:00,2110000001,24.27,467.37, 11342,14193
2013-10-01 12:15:00,2013-10-01,12:15:00,2110000002,8.61,458.96,3949,7502
2013-10-01 12: 15:00,2013-10-01,12:15:00,2110000003,8.13,458.31,3725,7338
2013-10-01 12:20:00,2013-10-01,12:20: 00,2110000001,22.3,461.71,10279.5,12735.5
2013-10-01 12:20:00,2013-10-01,12:20:00,2110000002,8.51,461.87,3929,7553.5
2013-10-01 12:20:00,2013-10-01,12:20:00,2110000003,7.83,462.19,3618.5,7331.5

这是我运行的代码:

  library(data.table)
a< -fread(complete1.csv)
a [,`:=`(Timestamp = ymd_hms(Timestamp),
Date = ymd(Date),
SN = as。因子(SN))]
a [SN == c(2110000001,2110000002),c(Timestamp,Date,Time,SN ,Pac),with = FALSE]

我得到这个输出:

 > a [SN == c(2110000001,2110000002),c(Timestamp,Date,Time,SN,A.Ms.Watt,Pac), ] 
时间戳日期时间SN A.Ms.Watt Pac
1:2013-10-01 12:00:00 2013-10-01 12:00:00 2110000001 10723.0 13544.5
2: 2013-10-01 12:00:00 2013-10-01 12:00:00 2110000002 3704.0 6860.0
3:2013-10-01 12:10:00 2013-10-01 12:10:00 2110000001 11278.5 14127.5
4:2013-10-01 12:10:00 2013-10-01 12:10:00 2110000002 3952.0 7475.5
5:2013-10-01 12:20:00 2013-10 -01 12:20:00 2110000001 10279.5 12735.5
6:2013-10-01 12:20:00 2013-10-01 12:20:00 2110000002 3929.0 7553.5
警告消息:
1:在is.na(e1)|中is.na(e2):
更长的对象长度不是短对象长度的倍数
2:在`==。default`(SN,c(2110000001,2110000002))
更长的对象长度不是更短的对象长度的倍数

相当理解警告。但我每12:xx:x5间隔(例如12:00:05)丢失数据。

解决方案

这不是一个 data.table 问题,但不正确的操作员问题。运算符 == 是向量化的。看看当你看看会发生什么:

  a [,list(Timestamp,SN,SN == c(2110000001, 2110000002))] 

时间戳记SN V3
1:2013-10-01 12:00:00 2110000001 TRUE
2:2013-10-01 12:00: 00 2110000002 TRUE
3:2013-10-01 12:00:00 2110000003 FALSE
4:2013-10-01 12:05:00 2110000001 FALSE
5:2013-10-01 12:05:00 2110000002 FALSE
6:2013-10-01 12:05:00 2110000003 FALSE
7:2013-10-01 12:10:00 2110000001 TRUE
8:2013 -10-01 12:10:00 2110000002 TRUE
9:2013-10-01 12:10:00 2110000003 FALSE
10:2013-10-01 12:15:00 2110000001 FALSE
11:2013-10-01 12:15:00 2110000002 FALSE
12:2013-10-01 12:15:00 2110000003 FALSE
13:2013-10-01 12:20:00 2110000001 TRUE
14:2013-10-01 12:20:00 2110000002 TRUE
15:2013-10-01 12:20:00 2110000003 FALSE
警告消息:
在SN == c(2110000001,2110000002):
更长的对象长度不是较短对象长度的倍数

这在R语言手册中有说明,在运算符


R一次处理整个数据向量,大多数基本运算符和基本数学像 log 的函数是向量化的(如上表所示)。这意味着添加相同长度的两个向量将创建包含元素方式求和的向量,隐含地循环在向量索引上。这也适用于 - * / 如果你想要 TRUE SN 是值 c(2110000001,2110000002)时,使用%in%c(2110000001,2110000002)中的%% )


I am losing data when I try to subset my data.table.

Here's the .csv which the file is written from

Timestamp,Date,Time,SN,A.Ms.Amp,A.Ms.Vol,A.Ms.Watt,Pac
2013-10-01 12:00:00,2013-10-01,12:00:00,2110000001,23.04,465.43,10723,13544.5
2013-10-01 12:00:00,2013-10-01,12:00:00,2110000002,7.81,474.16,3704,6860
2013-10-01 12:00:00,2013-10-01,12:00:00,2110000003,6.97,484.19,3374,6661
2013-10-01 12:05:00,2013-10-01,12:05:00,2110000001,23.19,467.05,10830,13576
2013-10-01 12:05:00,2013-10-01,12:05:00,2110000002,8.4,462.52,3883.5,7366.5
2013-10-01 12:05:00,2013-10-01,12:05:00,2110000003,7.72,470.6,3631,7169
2013-10-01 12:10:00,2013-10-01,12:10:00,2110000001,23.98,470.29,11278.5,14127.5
2013-10-01 12:10:00,2013-10-01,12:10:00,2110000002,8.62,458.47,3952,7475.5
2013-10-01 12:10:00,2013-10-01,12:10:00,2110000003,7.9,462.62,3654,7182.33
2013-10-01 12:15:00,2013-10-01,12:15:00,2110000001,24.27,467.37,11342,14193
2013-10-01 12:15:00,2013-10-01,12:15:00,2110000002,8.61,458.96,3949,7502
2013-10-01 12:15:00,2013-10-01,12:15:00,2110000003,8.13,458.31,3725,7338
2013-10-01 12:20:00,2013-10-01,12:20:00,2110000001,22.3,461.71,10279.5,12735.5
2013-10-01 12:20:00,2013-10-01,12:20:00,2110000002,8.51,461.87,3929,7553.5
2013-10-01 12:20:00,2013-10-01,12:20:00,2110000003,7.83,462.19,3618.5,7331.5

Here's the code I ran:

library(data.table)
a<-fread("complete1.csv")
a[,`:=`(Timestamp=ymd_hms(Timestamp),
Date=ymd(Date),
SN=as.factor(SN))]
a[SN==c("2110000001","2110000002"),c("Timestamp","Date","Time","SN","A.Ms.Watt","Pac"),with=FALSE]

I get this output:

   > a[SN==c("2110000001","2110000002"),c("Timestamp","Date","Time","SN","A.Ms.Watt","Pac"),with=FALSE]
             Timestamp       Date     Time         SN A.Ms.Watt     Pac
1: 2013-10-01 12:00:00 2013-10-01 12:00:00 2110000001   10723.0 13544.5
2: 2013-10-01 12:00:00 2013-10-01 12:00:00 2110000002    3704.0  6860.0
3: 2013-10-01 12:10:00 2013-10-01 12:10:00 2110000001   11278.5 14127.5
4: 2013-10-01 12:10:00 2013-10-01 12:10:00 2110000002    3952.0  7475.5
5: 2013-10-01 12:20:00 2013-10-01 12:20:00 2110000001   10279.5 12735.5
6: 2013-10-01 12:20:00 2013-10-01 12:20:00 2110000002    3929.0  7553.5
Warning messages:
1: In is.na(e1) | is.na(e2) :
  longer object length is not a multiple of shorter object length
2: In `==.default`(SN, c("2110000001", "2110000002")) :
  longer object length is not a multiple of shorter object length

Unfortunately, I don't quite understand the warnings. But I am losing data at every 12:xx:x5 intervals (e.g. 12:00:05). What could I be doing wrong?

解决方案

This is not a data.table problem, but an improper operator problem. The operator == is vectorized. See what happens when you look at:

a[,list(Timestamp,SN, SN == c("2110000001","2110000002"))]

              Timestamp         SN    V3
 1: 2013-10-01 12:00:00 2110000001  TRUE
 2: 2013-10-01 12:00:00 2110000002  TRUE
 3: 2013-10-01 12:00:00 2110000003 FALSE
 4: 2013-10-01 12:05:00 2110000001 FALSE
 5: 2013-10-01 12:05:00 2110000002 FALSE
 6: 2013-10-01 12:05:00 2110000003 FALSE
 7: 2013-10-01 12:10:00 2110000001  TRUE
 8: 2013-10-01 12:10:00 2110000002  TRUE
 9: 2013-10-01 12:10:00 2110000003 FALSE
10: 2013-10-01 12:15:00 2110000001 FALSE
11: 2013-10-01 12:15:00 2110000002 FALSE
12: 2013-10-01 12:15:00 2110000003 FALSE
13: 2013-10-01 12:20:00 2110000001  TRUE
14: 2013-10-01 12:20:00 2110000002  TRUE
15: 2013-10-01 12:20:00 2110000003 FALSE
Warning message:
In SN == c("2110000001", "2110000002") :
  longer object length is not a multiple of shorter object length

This is documented in the R language manual, in Operators:

R deals with entire vectors of data at a time, and most of the elementary operators and basic mathematical functions like log are vectorized (as indicated in the table above). This means that e.g. adding two vectors of the same length will create a vector containing the element-wise sums, implicitly looping over the vector index. This applies also to other operators like -, *, and / as well as to higher dimensional structures.

If you want TRUE when SN is either of the values c("2110000001","2110000002"), use %in%, like

SN %in% c("2110000001","2110000002")

这篇关于R:子集期间丢失数据(data.table)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆