为缺少时间的步骤添加行的最快方法? [英] Fastest way to add rows for missing time steps?

查看:80
本文介绍了为缺少时间的步骤添加行的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据集中有一列,其中时间段(Time)是从a到b的整数.有时,任何给定的组可能缺少时间段.我想用NA填充这些行.以下是1个(共1000个)组的示例数据.

I have a column in my datasets where time periods (Time) are integers ranging from a-b. Sometimes there might be missing time periods for any given group. I'd like to fill in those rows with NA. Below is example data for 1 (of several 1000) group(s).

structure(list(Id = c(1, 1, 1, 1), Time = c(1, 2, 4, 5), Value = c(0.568780482159894, 
-0.7207749516298, 1.24258192959273, 0.682123081696789)), .Names = c("Id", 
"Time", "Value"), row.names = c(NA, 4L), class = "data.frame")


  Id Time      Value
1  1    1  0.5687805
2  1    2 -0.7207750
3  1    4  1.2425819
4  1    5  0.6821231

如您所见,时间3丢失了.通常一个或多个可能会丢失.我可以自己解决此问题,但我恐怕不会以最有效的方式做到这一点.我的方法是创建一个函数:

As you can see, Time 3 is missing. Often one or more could be missing. I can solve this on my own but am afraid I wouldn't be doing this the most efficient way. My approach would be to create a function that:

生成从min(Time)max(Time)

然后执行setdiff来获取丢失的Time值.

Then do a setdiff to grab missing Time values.

将该向量转换为data.frame

拉唯一标识符变量(Id和上面未列出的其他变量),并将其添加到此data.frame中.

Pull unique identifier variables (Id and others not listed above), and add that to this data.frame.

将两者合并.

从功能返回.

因此整个过程将如下执行:

So the entire process would then get executed as below:

   # Split the data into individual data.frames by Id.
    temp_list <- dlply(original_data, .(Id)) 
    # pad each data.frame
    tlist2 <- llply(temp_list, my_pad_function)
    # collapse the list back to a data.frame
    filled_in_data <- ldply(tlist2)

实现这一目标的更好方法?

Better way to achieve this?

推荐答案

跟本·巴恩斯(Ben Barnes)发表评论,然后从他的mydf3开始:

Following up on comments with Ben Barnes and starting with his mydf3 :

DT = as.data.table(mydf3)
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time)))]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7           NA  NA
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time)))]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5         NA
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7         NA
 [8,]  1   1    8         NA
 [9,]  1   1    9         NA
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed. 

CJ表示交叉连接,请参见?CJ.发生NA的填充是因为nomatch在默认情况下是NA.将nomatch设置为0来删除没有匹配项.如果不是用NA填充,而是需要占主导的行,则只需添加roll=TRUE.这比用NA s填充然后再填充NA s更有效.请参阅?data.table中对roll的描述.

CJ stands for Cross Join, see ?CJ. The padding with NAs happens because nomatch by default is NA. Set nomatch to 0 instead to remove the no matches. If instead of padding with NAs the prevailing row is required, just add roll=TRUE. This can be more efficient than padding with NAs and then filling NAs afterwards. See the description of roll in ?data.table.

setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time))),roll=TRUE]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7 -0.691736451   1
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time))),roll=TRUE]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5 -1.9126874
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7 -0.6917365
 [8,]  1   1    8 -0.6917365
 [9,]  1   1    9 -0.6917365
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed. 


您可以使用on代替设置键. CJ也接受unique自变量.一个带有两个"Id"的小例子:


Instead of setting keys, you may use on. CJ also takes a unique argument. A small example with two 'Id':

d <- data.table(Id = rep(1:2, 4:3), Time = c(1, 2, 4, 5, 2, 3, 4), val = 1:7)

d[CJ(Id, Time = seq(min(Time), max(Time)), unique = TRUE), on = .(Id, Time)]
#     Id Time val
# 1:   1    1   1
# 2:   1    2   2
# 3:   1    3  NA
# 4:   1    4   3
# 5:   1    5   4
# 6:   2    1  NA
# 7:   2    2   5
# 8:   2    3   6
# 9:   2    4   7
# 10:  2    5  NA

在这种特殊情况下,其中CJ中的向量之一是由seq生成的,结果必须明确命名,以匹配on中指定的名称.但是,当在CJ中使用裸变量(如此处的"Id")时,它们将被自动命名,就像在data.table()中(来自data.table 1.12.2)一样.

In this particular case, where one of the vectors in CJ was generated with seq, the result needs to be named explictly in order to match the names specified in on. When using bare variables in CJ though (like 'Id' here), they are auto-named, like in data.table() (from data.table 1.12.2).

这篇关于为缺少时间的步骤添加行的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆