有效地从 1 个数据帧中的多个时间序列的开头和结尾删除缺失值 [英] Efficiently removing missing values from the start and end of multiple time series in 1 data frame

查看:10
本文介绍了有效地从 1 个数据帧中的多个时间序列的开头和结尾删除缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 R,我正在尝试从包含多个时间序列的数据框的开头和结尾修剪 NA 值.我已经使用 for 循环和 zoo 包实现了我的目标,但正如预期的那样,它在大型数据帧上效率极低.

Using R, I'm trying to trim NA values from the start and end of a data frame that contains multiple time series. I have achieved my goal using a for loop and the zoo package, but as expected it is extremely inefficient on large data frames.

我的数据框看起来像这样,包含 3 列,每个时间序列由它的唯一 ID 标识.在这种情况下,AAA、B 和 CCC.

My data frame look like this and contains 3 columns with each time series identified by it's unique id. In this case AAA, B and CCC.

id   date          value
AAA  2010/01/01    NA
AAA  2010/02/01    34
AAA  2010/03/01    35
AAA  2010/04/01    30
AAA  2010/05/01    NA
AAA  2010/06/01    28
B    2010/01/01    NA
B    2010/02/01    0
B    2010/03/01    1
B    2010/04/01    2
B    2010/05/01    3
B    2010/06/01    NA
B    2010/07/01    NA
B    2010/07/01    NA
CCC  2010/01/01    0
CCC  2010/02/01    400
CCC  2010/03/01    300
CCC  2010/04/01    200
CCC  2010/05/01    NA

我想知道,如何有效地从每个时间序列的开头和结尾删除 NA 值,在本例中为 AAA、B 和 CCC.所以它应该看起来像这样.

I would like to know, how can I efficiently remove the NA values from the start and end of each time series, in this case AAA, B and CCC. So it should look like this.

id   date          value
AAA  2010/02/01    34
AAA  2010/03/01    35
AAA  2010/04/01    30
AAA  2010/05/01    NA
AAA  2010/06/01    28
B    2010/02/01    0
B    2010/03/01    1
B    2010/04/01    2
B    2010/05/01    3
CCC  2010/01/01    0
CCC  2010/02/01    400
CCC  2010/03/01    300
CCC  2010/04/01    200

推荐答案

我会这样,应该很快:

require(data.table)
DT = as.data.table(your data)   # please provide something pastable

DT2 = DT[!is.na(value)]
setkey(DT,id,date)
setkey(DT2,id,date)
tokeep = DT2[DT,!is.na(value),rolltolast=TRUE,mult="last"]
DT = DT[tokeep]

这是通过在每个组内滚动流行的非 NA 来实现的,但不会超过最后一个.

This works by rolling forward the prevailing non-NA, but not past the last one, within each group.

mult="last" 是可选的.如果使用 v1.8.0(在 CRAN 上),它应该会加快速度.对有和没有它的时间感兴趣.默认情况下,data.table 连接到组(mult="all"),但在这种情况下,我们连接到键的所有列,而且,我们知道密钥是唯一的;即,密钥中没有重复.在 v1.8.1(开发版)中,无需了解这一点,它会更加照顾您.

The mult="last" is optional. It should speed it up if v1.8.0 (on CRAN) is used. Interested in timings with and without it. By default data.table joins to groups (mult="all"), but in this case we're joining to all columns of the key, and, we know the key is unique; i.e., no dups in key. In v1.8.1 (in dev) there isn't a need to know about this and it looks after you more.

这篇关于有效地从 1 个数据帧中的多个时间序列的开头和结尾删除缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆