为缺少时间的步骤添加行的最快方法? [英] Fastest way to add rows for missing time steps?
问题描述
我的数据集中有一列,其中时间段(Time
)是从a到b的整数.有时,任何给定的组可能缺少时间段.我想用NA
填充这些行.以下是1个(共1000个)组的示例数据.
I have a column in my datasets where time periods (Time
) are integers ranging from a-b. Sometimes there might be missing time periods for any given group. I'd like to fill in those rows with NA
. Below is example data for 1 (of several 1000) group(s).
structure(list(Id = c(1, 1, 1, 1), Time = c(1, 2, 4, 5), Value = c(0.568780482159894,
-0.7207749516298, 1.24258192959273, 0.682123081696789)), .Names = c("Id",
"Time", "Value"), row.names = c(NA, 4L), class = "data.frame")
Id Time Value
1 1 1 0.5687805
2 1 2 -0.7207750
3 1 4 1.2425819
4 1 5 0.6821231
如您所见,时间3丢失了.通常一个或多个可能会丢失.我可以自己解决此问题,但我恐怕不会以最有效的方式做到这一点.我的方法是创建一个函数:
As you can see, Time 3 is missing. Often one or more could be missing. I can solve this on my own but am afraid I wouldn't be doing this the most efficient way. My approach would be to create a function that:
生成从min(Time)
到max(Time)
然后执行setdiff
来获取丢失的Time
值.
Then do a setdiff
to grab missing Time
values.
将该向量转换为data.frame
拉唯一标识符变量(Id
和上面未列出的其他变量),并将其添加到此data.frame中.
Pull unique identifier variables (Id
and others not listed above), and add that to this data.frame.
将两者合并.
从功能返回.
因此整个过程将如下执行:
So the entire process would then get executed as below:
# Split the data into individual data.frames by Id.
temp_list <- dlply(original_data, .(Id))
# pad each data.frame
tlist2 <- llply(temp_list, my_pad_function)
# collapse the list back to a data.frame
filled_in_data <- ldply(tlist2)
实现这一目标的更好方法?
Better way to achieve this?
推荐答案
跟本·巴恩斯(Ben Barnes)发表评论,然后从他的mydf3
开始:
Following up on comments with Ben Barnes and starting with his mydf3
:
DT = as.data.table(mydf3)
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time)))]
Id Time Value Id2
[1,] 1 1 -0.262482283 2
[2,] 1 2 -1.423935165 2
[3,] 1 3 0.500523295 1
[4,] 1 4 -1.912687398 1
[5,] 1 5 -1.459766444 2
[6,] 1 6 -0.691736451 1
[7,] 1 7 NA NA
[8,] 1 8 0.001041489 2
[9,] 1 9 0.495820559 2
[10,] 1 10 -0.673167744 1
First 10 rows of 12800 printed.
setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time)))]
Id Id2 Time Value
[1,] 1 1 1 NA
[2,] 1 1 2 NA
[3,] 1 1 3 0.5005233
[4,] 1 1 4 -1.9126874
[5,] 1 1 5 NA
[6,] 1 1 6 -0.6917365
[7,] 1 1 7 NA
[8,] 1 1 8 NA
[9,] 1 1 9 NA
[10,] 1 1 10 -0.6731677
First 10 rows of 25600 printed.
CJ
表示交叉连接,请参见?CJ
.发生NA
的填充是因为nomatch
在默认情况下是NA
.将nomatch
设置为0
来删除没有匹配项.如果不是用NA
填充,而是需要占主导的行,则只需添加roll=TRUE
.这比用NA
s填充然后再填充NA
s更有效.请参阅?data.table
中对roll
的描述.
CJ
stands for Cross Join, see ?CJ
. The padding with NA
s happens because nomatch
by default is NA
. Set nomatch
to 0
instead to remove the no matches. If instead of padding with NA
s the prevailing row is required, just add roll=TRUE
. This can be more efficient than padding with NA
s and then filling NA
s afterwards. See the description of roll
in ?data.table
.
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time))),roll=TRUE]
Id Time Value Id2
[1,] 1 1 -0.262482283 2
[2,] 1 2 -1.423935165 2
[3,] 1 3 0.500523295 1
[4,] 1 4 -1.912687398 1
[5,] 1 5 -1.459766444 2
[6,] 1 6 -0.691736451 1
[7,] 1 7 -0.691736451 1
[8,] 1 8 0.001041489 2
[9,] 1 9 0.495820559 2
[10,] 1 10 -0.673167744 1
First 10 rows of 12800 printed.
setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time))),roll=TRUE]
Id Id2 Time Value
[1,] 1 1 1 NA
[2,] 1 1 2 NA
[3,] 1 1 3 0.5005233
[4,] 1 1 4 -1.9126874
[5,] 1 1 5 -1.9126874
[6,] 1 1 6 -0.6917365
[7,] 1 1 7 -0.6917365
[8,] 1 1 8 -0.6917365
[9,] 1 1 9 -0.6917365
[10,] 1 1 10 -0.6731677
First 10 rows of 25600 printed.
您可以使用on
代替设置键. CJ
也接受unique
自变量.一个带有两个"Id"的小例子:
Instead of setting keys, you may use on
. CJ
also takes a unique
argument. A small example with two 'Id':
d <- data.table(Id = rep(1:2, 4:3), Time = c(1, 2, 4, 5, 2, 3, 4), val = 1:7)
d[CJ(Id, Time = seq(min(Time), max(Time)), unique = TRUE), on = .(Id, Time)]
# Id Time val
# 1: 1 1 1
# 2: 1 2 2
# 3: 1 3 NA
# 4: 1 4 3
# 5: 1 5 4
# 6: 2 1 NA
# 7: 2 2 5
# 8: 2 3 6
# 9: 2 4 7
# 10: 2 5 NA
在这种特殊情况下,其中CJ
中的向量之一是由seq
生成的,结果必须明确命名,以匹配on
中指定的名称.但是,当在CJ
中使用裸变量(如此处的"Id")时,它们将被自动命名,就像在data.table()
中(来自data.table 1.12.2
)一样.
In this particular case, where one of the vectors in CJ
was generated with seq
, the result needs to be named explictly in order to match the names specified in on
. When using bare variables in CJ
though (like 'Id' here), they are auto-named, like in data.table()
(from data.table 1.12.2
).
这篇关于为缺少时间的步骤添加行的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!