R:子集期间丢失数据(data.table) [英] R: Losing data during subsetting (data.table)
问题描述
当我尝试将我的data.table子集化时,我正在丢失数据。
这是文件从
写的.csv 时间戳,日期,时间,SN,A.Ms.Amp,A.Ms.Vol,A.Ms.Watt,Pac
2013-10 -01 12:00:00,2013-10-01,12:00:00,2110000001,23.04,465.43,10723,13544.5
2013-10-01 12:00:00,2013-10-01, 12:00:00,2110000002,7.81,474.16,3704,6860
2013-10-01 12:00:00,2013-10-01,12:00:00,2110000003,6.97,484.19,3374, 6661
2013-10-01 12:05:00,2013-10-01,12:05:00,2110000001,23.19,467.05,10830,13576
2013-10-01 12:05: 00,2013-10-01,12:05:00,2110000002,8.4,462.52,3883.5,7366.5
2013-10-01 12:05:00,2013-10-01,12:05:00, 2110000003,7.72,470.6,3631,7169
2013-10-01 12:10:00,2013-10-01,12:10:00,2110000001,23.98,470.29,11278.5,14127.5
2013 -10-01 12:10:00,2013-10-01,12:10:00,2110000002,8.62,458.47,3952,7475.5
2013-10-01 12:10:00,2013-10- 01,12:10:00,2110000003,7.9,462.62,3654,7182.33
2013-10-01 12:15:00,2013-10-01,12:15:00,2110000001,24.27,467.37, 11342,14193
2013-10-01 12:15:00,2013-10-01,12:15:00,2110000002,8.61,458.96,3949,7502
2013-10-01 12: 15:00,2013-10-01,12:15:00,2110000003,8.13,458.31,3725,7338
2013-10-01 12:20:00,2013-10-01,12:20: 00,2110000001,22.3,461.71,10279.5,12735.5
2013-10-01 12:20:00,2013-10-01,12:20:00,2110000002,8.51,461.87,3929,7553.5
2013-10-01 12:20:00,2013-10-01,12:20:00,2110000003,7.83,462.19,3618.5,7331.5
这是我运行的代码:
library(data.table)
a< -fread(complete1.csv)
a [,`:=`(Timestamp = ymd_hms(Timestamp),
Date = ymd(Date),
SN = as。因子(SN))]
a [SN == c(2110000001,2110000002),c(Timestamp,Date,Time,SN ,Pac),with = FALSE]
我得到这个输出:
> a [SN == c(2110000001,2110000002),c(Timestamp,Date,Time,SN,A.Ms.Watt,Pac), ]
时间戳日期时间SN A.Ms.Watt Pac
1:2013-10-01 12:00:00 2013-10-01 12:00:00 2110000001 10723.0 13544.5
2: 2013-10-01 12:00:00 2013-10-01 12:00:00 2110000002 3704.0 6860.0
3:2013-10-01 12:10:00 2013-10-01 12:10:00 2110000001 11278.5 14127.5
4:2013-10-01 12:10:00 2013-10-01 12:10:00 2110000002 3952.0 7475.5
5:2013-10-01 12:20:00 2013-10 -01 12:20:00 2110000001 10279.5 12735.5
6:2013-10-01 12:20:00 2013-10-01 12:20:00 2110000002 3929.0 7553.5
警告消息:
1:在is.na(e1)|中is.na(e2):
更长的对象长度不是短对象长度的倍数
2:在`==。default`(SN,c(2110000001,2110000002))
更长的对象长度不是更短的对象长度的倍数
相当理解警告。但我每12:xx:x5间隔(例如12:00:05)丢失数据。
这不是一个 data.table
问题,但不正确的操作员问题。运算符 ==
是向量化的。看看当你看看会发生什么:
a [,list(Timestamp,SN,SN == c(2110000001, 2110000002))]
时间戳记SN V3
1:2013-10-01 12:00:00 2110000001 TRUE
2:2013-10-01 12:00: 00 2110000002 TRUE
3:2013-10-01 12:00:00 2110000003 FALSE
4:2013-10-01 12:05:00 2110000001 FALSE
5:2013-10-01 12:05:00 2110000002 FALSE
6:2013-10-01 12:05:00 2110000003 FALSE
7:2013-10-01 12:10:00 2110000001 TRUE
8:2013 -10-01 12:10:00 2110000002 TRUE
9:2013-10-01 12:10:00 2110000003 FALSE
10:2013-10-01 12:15:00 2110000001 FALSE
11:2013-10-01 12:15:00 2110000002 FALSE
12:2013-10-01 12:15:00 2110000003 FALSE
13:2013-10-01 12:20:00 2110000001 TRUE
14:2013-10-01 12:20:00 2110000002 TRUE
15:2013-10-01 12:20:00 2110000003 FALSE
警告消息:
在SN == c(2110000001,2110000002):
更长的对象长度不是较短对象长度的倍数
这在R语言手册中有说明,在运算符:
R一次处理整个数据向量,大多数基本运算符和基本数学像
如果你想要log
的函数是向量化的(如上表所示)。这意味着添加相同长度的两个向量将创建包含元素方式求和的向量,隐含地循环在向量索引上。这也适用于-
,*
和/ $ c $
TRUE
当SN
是值c(2110000001,2110000002)
时,使用%in%c(2110000001,2110000002)中的%% )
I am losing data when I try to subset my data.table.
Here's the .csv which the file is written from
Timestamp,Date,Time,SN,A.Ms.Amp,A.Ms.Vol,A.Ms.Watt,Pac 2013-10-01 12:00:00,2013-10-01,12:00:00,2110000001,23.04,465.43,10723,13544.5 2013-10-01 12:00:00,2013-10-01,12:00:00,2110000002,7.81,474.16,3704,6860 2013-10-01 12:00:00,2013-10-01,12:00:00,2110000003,6.97,484.19,3374,6661 2013-10-01 12:05:00,2013-10-01,12:05:00,2110000001,23.19,467.05,10830,13576 2013-10-01 12:05:00,2013-10-01,12:05:00,2110000002,8.4,462.52,3883.5,7366.5 2013-10-01 12:05:00,2013-10-01,12:05:00,2110000003,7.72,470.6,3631,7169 2013-10-01 12:10:00,2013-10-01,12:10:00,2110000001,23.98,470.29,11278.5,14127.5 2013-10-01 12:10:00,2013-10-01,12:10:00,2110000002,8.62,458.47,3952,7475.5 2013-10-01 12:10:00,2013-10-01,12:10:00,2110000003,7.9,462.62,3654,7182.33 2013-10-01 12:15:00,2013-10-01,12:15:00,2110000001,24.27,467.37,11342,14193 2013-10-01 12:15:00,2013-10-01,12:15:00,2110000002,8.61,458.96,3949,7502 2013-10-01 12:15:00,2013-10-01,12:15:00,2110000003,8.13,458.31,3725,7338 2013-10-01 12:20:00,2013-10-01,12:20:00,2110000001,22.3,461.71,10279.5,12735.5 2013-10-01 12:20:00,2013-10-01,12:20:00,2110000002,8.51,461.87,3929,7553.5 2013-10-01 12:20:00,2013-10-01,12:20:00,2110000003,7.83,462.19,3618.5,7331.5
Here's the code I ran:
library(data.table) a<-fread("complete1.csv") a[,`:=`(Timestamp=ymd_hms(Timestamp), Date=ymd(Date), SN=as.factor(SN))] a[SN==c("2110000001","2110000002"),c("Timestamp","Date","Time","SN","A.Ms.Watt","Pac"),with=FALSE]
I get this output:
> a[SN==c("2110000001","2110000002"),c("Timestamp","Date","Time","SN","A.Ms.Watt","Pac"),with=FALSE] Timestamp Date Time SN A.Ms.Watt Pac 1: 2013-10-01 12:00:00 2013-10-01 12:00:00 2110000001 10723.0 13544.5 2: 2013-10-01 12:00:00 2013-10-01 12:00:00 2110000002 3704.0 6860.0 3: 2013-10-01 12:10:00 2013-10-01 12:10:00 2110000001 11278.5 14127.5 4: 2013-10-01 12:10:00 2013-10-01 12:10:00 2110000002 3952.0 7475.5 5: 2013-10-01 12:20:00 2013-10-01 12:20:00 2110000001 10279.5 12735.5 6: 2013-10-01 12:20:00 2013-10-01 12:20:00 2110000002 3929.0 7553.5 Warning messages: 1: In is.na(e1) | is.na(e2) : longer object length is not a multiple of shorter object length 2: In `==.default`(SN, c("2110000001", "2110000002")) : longer object length is not a multiple of shorter object length
Unfortunately, I don't quite understand the warnings. But I am losing data at every 12:xx:x5 intervals (e.g. 12:00:05). What could I be doing wrong?
解决方案This is not a
data.table
problem, but an improper operator problem. The operator==
is vectorized. See what happens when you look at:a[,list(Timestamp,SN, SN == c("2110000001","2110000002"))] Timestamp SN V3 1: 2013-10-01 12:00:00 2110000001 TRUE 2: 2013-10-01 12:00:00 2110000002 TRUE 3: 2013-10-01 12:00:00 2110000003 FALSE 4: 2013-10-01 12:05:00 2110000001 FALSE 5: 2013-10-01 12:05:00 2110000002 FALSE 6: 2013-10-01 12:05:00 2110000003 FALSE 7: 2013-10-01 12:10:00 2110000001 TRUE 8: 2013-10-01 12:10:00 2110000002 TRUE 9: 2013-10-01 12:10:00 2110000003 FALSE 10: 2013-10-01 12:15:00 2110000001 FALSE 11: 2013-10-01 12:15:00 2110000002 FALSE 12: 2013-10-01 12:15:00 2110000003 FALSE 13: 2013-10-01 12:20:00 2110000001 TRUE 14: 2013-10-01 12:20:00 2110000002 TRUE 15: 2013-10-01 12:20:00 2110000003 FALSE Warning message: In SN == c("2110000001", "2110000002") : longer object length is not a multiple of shorter object length
This is documented in the R language manual, in Operators:
R deals with entire vectors of data at a time, and most of the elementary operators and basic mathematical functions like
log
are vectorized (as indicated in the table above). This means that e.g. adding two vectors of the same length will create a vector containing the element-wise sums, implicitly looping over the vector index. This applies also to other operators like-
,*
, and/
as well as to higher dimensional structures.If you want
TRUE
whenSN
is either of the valuesc("2110000001","2110000002")
, use%in%
, likeSN %in% c("2110000001","2110000002")
这篇关于R:子集期间丢失数据(data.table)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!