从1个数据帧中的多个时间序列的开始和结束有效地删除缺失值 [英] Efficiently removing missing values from the start and end of multiple time series in 1 data frame
问题描述
使用R,我试图修剪包含多个时间序列的数据帧的开始和结束处的NA值。我已经实现了我的目标使用一个for循环和动物园包,但正如预期的,对大型数据框是非常低效的。
Using R, I'm trying to trim NA values from the start and end of a data frame that contains multiple time series. I have achieved my goal using a for loop and the zoo package, but as expected it is extremely inefficient on large data frames.
我的数据框看起来像这样,包含3列,每个时间序列由其唯一ID标识。在这种情况下为AAA,B和CCC。
My data frame look like this and contains 3 columns with each time series identified by it's unique id. In this case AAA, B and CCC.
id date value
AAA 2010/01/01 NA
AAA 2010/02/01 34
AAA 2010/03/01 35
AAA 2010/04/01 30
AAA 2010/05/01 NA
AAA 2010/06/01 28
B 2010/01/01 NA
B 2010/02/01 0
B 2010/03/01 1
B 2010/04/01 2
B 2010/05/01 3
B 2010/06/01 NA
B 2010/07/01 NA
B 2010/07/01 NA
CCC 2010/01/01 0
CCC 2010/02/01 400
CCC 2010/03/01 300
CCC 2010/04/01 200
CCC 2010/05/01 NA
我想知道,如何从每个时间序列的开始和结束有效地删除NA值,病例AAA,B和CCC。所以它应该是这样的。
I would like to know, how can I efficiently remove the NA values from the start and end of each time series, in this case AAA, B and CCC. So it should look like this.
id date value
AAA 2010/02/01 34
AAA 2010/03/01 35
AAA 2010/04/01 30
AAA 2010/05/01 NA
AAA 2010/06/01 28
B 2010/02/01 0
B 2010/03/01 1
B 2010/04/01 2
B 2010/05/01 3
CCC 2010/01/01 0
CCC 2010/02/01 400
CCC 2010/03/01 300
CCC 2010/04/01 200
推荐答案
我会这样做,应该是非常快:
I would do it like this, which should be very fast :
require(data.table)
DT = as.data.table(your data) # please provide something pastable
DT2 = DT[!is.na(value)]
setkey(DT,id,date)
setkey(DT2,id,date)
tokeep = DT2[DT,!is.na(value),rolltolast=TRUE,mult="last"]
DT = DT[tokeep]
mult =last
是可选的。如果使用v1.8.0(CRAN上),它应该加速。感兴趣的时间有和没有它。默认情况下, data.table
加入组( mult =all
),但在这种情况下,到所有列的键,并且,我们知道键是唯一的;即,在密钥中没有重复。在v1.8.1(在dev)没有需要知道这个,它照顾你更多。
The mult="last"
is optional. It should speed it up if v1.8.0 (on CRAN) is used. Interested in timings with and without it. By default data.table
joins to groups (mult="all"
), but in this case we're joining to all columns of the key, and, we know the key is unique; i.e., no dups in key. In v1.8.1 (in dev) there isn't a need to know about this and it looks after you more.
这篇关于从1个数据帧中的多个时间序列的开始和结束有效地删除缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!