为缺少的时间步长添加行的最快方法? [英] Fastest way to add rows for missing time steps?

查看:27
本文介绍了为缺少的时间步长添加行的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据集中有一列,其中时间段 (Time) 是范围从 a-b 的整数.有时,任何给定的组都可能缺少时间段.我想用 NA 填充这些行.以下是 1 个(共 1000 个)组的示例数据.

I have a column in my datasets where time periods (Time) are integers ranging from a-b. Sometimes there might be missing time periods for any given group. I'd like to fill in those rows with NA. Below is example data for 1 (of several 1000) group(s).

structure(list(Id = c(1, 1, 1, 1), Time = c(1, 2, 4, 5), Value = c(0.568780482159894, 
-0.7207749516298, 1.24258192959273, 0.682123081696789)), .Names = c("Id", 
"Time", "Value"), row.names = c(NA, 4L), class = "data.frame")


  Id Time      Value
1  1    1  0.5687805
2  1    2 -0.7207750
3  1    4  1.2425819
4  1    5  0.6821231

如您所见,缺少时间 3.通常可能会丢失一个或多个.我可以自己解决这个问题,但恐怕我不会以最有效的方式做到这一点.我的方法是创建一个函数:

As you can see, Time 3 is missing. Often one or more could be missing. I can solve this on my own but am afraid I wouldn't be doing this the most efficient way. My approach would be to create a function that:

生成从min(Time)max(Time)

然后执行 setdiff 以获取缺失的 Time 值.

Then do a setdiff to grab missing Time values.

将该向量转换为 data.frame

提取唯一标识符变量(Id 和上面未列出的其他变量),并将其添加到此 data.frame.

Pull unique identifier variables (Id and others not listed above), and add that to this data.frame.

合并两者.

函数返回.

所以整个过程将被执行如下:

So the entire process would then get executed as below:

   # Split the data into individual data.frames by Id.
    temp_list <- dlply(original_data, .(Id)) 
    # pad each data.frame
    tlist2 <- llply(temp_list, my_pad_function)
    # collapse the list back to a data.frame
    filled_in_data <- ldply(tlist2)

实现这一目标的更好方法是什么?

Better way to achieve this?

推荐答案

跟进 Ben Barnes 的评论并从他的 mydf3 开始:

Following up on comments with Ben Barnes and starting with his mydf3 :

DT = as.data.table(mydf3)
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time)))]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7           NA  NA
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time)))]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5         NA
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7         NA
 [8,]  1   1    8         NA
 [9,]  1   1    9         NA
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed. 

CJ 代表 Cross Join,参见 ?CJ.使用 NA 进行填充是因为 nomatch 默认是 NA.将 nomatch 设置为 0 以删除不匹配的内容.如果不是用 NA 填充,而是需要占优势的行,只需添加 roll=TRUE.这比用 NA 填充然后填充 NA 更有效.参见?data.tableroll的说明.

CJ stands for Cross Join, see ?CJ. The padding with NAs happens because nomatch by default is NA. Set nomatch to 0 instead to remove the no matches. If instead of padding with NAs the prevailing row is required, just add roll=TRUE. This can be more efficient than padding with NAs and then filling NAs afterwards. See the description of roll in ?data.table.

setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time))),roll=TRUE]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7 -0.691736451   1
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time))),roll=TRUE]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5 -1.9126874
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7 -0.6917365
 [8,]  1   1    8 -0.6917365
 [9,]  1   1    9 -0.6917365
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed. 

<小时>

您可以使用 on 代替设置键.CJ 也接受一个 unique 参数.一个带有两个Id"的小例子:


Instead of setting keys, you may use on. CJ also takes a unique argument. A small example with two 'Id':

d <- data.table(Id = rep(1:2, 4:3), Time = c(1, 2, 4, 5, 2, 3, 4), val = 1:7)

d[CJ(Id, Time = seq(min(Time), max(Time)), unique = TRUE), on = .(Id, Time)]
#     Id Time val
# 1:   1    1   1
# 2:   1    2   2
# 3:   1    3  NA
# 4:   1    4   3
# 5:   1    5   4
# 6:   2    1  NA
# 7:   2    2   5
# 8:   2    3   6
# 9:   2    4   7
# 10:  2    5  NA

在这种特殊情况下,CJ 中的一个向量是用 seq 生成的,结果需要显式命名以匹配 CJ 中指定的名称代码>开启.当在 CJ 中使用裸变量时(比如这里的 'Id'),它们是自动命名的,就像在 data.table() 中一样(来自 data.table1.12.2).

In this particular case, where one of the vectors in CJ was generated with seq, the result needs to be named explictly in order to match the names specified in on. When using bare variables in CJ though (like 'Id' here), they are auto-named, like in data.table() (from data.table 1.12.2).

这篇关于为缺少的时间步长添加行的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆