为data.frame中缺少值添加行的最快方法? [英] Fastest way to add rows for missing values in a data.frame?
问题描述
我在数据集中有一个列,其中时间段( Time
)是从a-b开始的整数。有时,任何给定组可能缺少时间段。我想用 NA
填充这些行。下面是1(几个1000)组的示例数据。
I have a column in my datasets where time periods (Time
) are integers ranging from a-b. Sometimes there might be missing time periods for any given group. I'd like to fill in those rows with NA
. Below is example data for 1 (of several 1000) group(s).
structure(list(Id = c(1, 1, 1, 1), Time = c(1, 2, 4, 5), Value = c(0.568780482159894,
-0.7207749516298, 1.24258192959273, 0.682123081696789)), .Names = c("Id",
"Time", "Value"), row.names = c(NA, 4L), class = "data.frame")
Id Time Value
1 1 1 0.5687805
2 1 2 -0.7207750
3 1 4 1.2425819
4 1 5 0.6821231
如您所见, 。通常一个或多个可能丢失。我可以自己解决这个问题,但恐怕我不会这样做的最有效的方式。我的方法是创建一个函数:
As you can see, Time 3 is missing. Often one or more could be missing. I can solve this on my own but am afraid I wouldn't be doing this the most efficient way. My approach would be to create a function that:
生成一个从 min(Time)
到 max(Time)
然后执行 setdiff
抓取缺少时间
的值。
Then do a setdiff
to grab missing Time
values.
将该向量转换为 data.frame
引用唯一标识符变量( Id
和上面未列出的其他变量),并将其添加到此data.frame。
Pull unique identifier variables (Id
and others not listed above), and add that to this data.frame.
合并两者。
从函数返回。
因此,整个过程将执行如下:
So the entire process would then get executed as below:
# Split the data into individual data.frames by Id.
temp_list <- dlply(original_data, .(Id))
# pad each data.frame
tlist2 <- llply(temp_list, my_pad_function)
# collapse the list back to a data.frame
filled_in_data <- ldply(tlist2)
这个?
推荐答案
跟随Ben Barnes的评论,从他的 mydf3
:
Following up on comments with Ben Barnes and starting with his mydf3
:
DT = as.data.table(mydf3)
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time)))]
Id Time Value Id2
[1,] 1 1 -0.262482283 2
[2,] 1 2 -1.423935165 2
[3,] 1 3 0.500523295 1
[4,] 1 4 -1.912687398 1
[5,] 1 5 -1.459766444 2
[6,] 1 6 -0.691736451 1
[7,] 1 7 NA NA
[8,] 1 8 0.001041489 2
[9,] 1 9 0.495820559 2
[10,] 1 10 -0.673167744 1
First 10 rows of 12800 printed.
setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time)))]
Id Id2 Time Value
[1,] 1 1 1 NA
[2,] 1 1 2 NA
[3,] 1 1 3 0.5005233
[4,] 1 1 4 -1.9126874
[5,] 1 1 5 NA
[6,] 1 1 6 -0.6917365
[7,] 1 1 7 NA
[8,] 1 1 8 NA
[9,] 1 1 9 NA
[10,] 1 1 10 -0.6731677
First 10 rows of 25600 printed.
$ @
$ b
CJ
代表Cross Join, code>?CJ 。由于 nomatch
默认为 NA
NA >。将 nomatch
设置为 0
,以删除无匹配项。如果不是使用 NA
填充行,则需要添加 roll = TRUE
。这可以比用 NA
填充,然后填充 NA
更有效。请参阅 roll
在?data.table
中的说明。
CJ
stands for Cross Join, see ?CJ
. The padding with NA
s happens because nomatch
by default is NA
. Set nomatch
to 0
instead to remove the no matches. If instead of padding with NA
s the prevailing row is required, just add roll=TRUE
. This can be more efficient than padding with NA
s and then filling NA
s afterwards. See the description of roll
in ?data.table
.
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time))),roll=TRUE]
Id Time Value Id2
[1,] 1 1 -0.262482283 2
[2,] 1 2 -1.423935165 2
[3,] 1 3 0.500523295 1
[4,] 1 4 -1.912687398 1
[5,] 1 5 -1.459766444 2
[6,] 1 6 -0.691736451 1
[7,] 1 7 -0.691736451 1
[8,] 1 8 0.001041489 2
[9,] 1 9 0.495820559 2
[10,] 1 10 -0.673167744 1
First 10 rows of 12800 printed.
setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time))),roll=TRUE]
Id Id2 Time Value
[1,] 1 1 1 NA
[2,] 1 1 2 NA
[3,] 1 1 3 0.5005233
[4,] 1 1 4 -1.9126874
[5,] 1 1 5 -1.9126874
[6,] 1 1 6 -0.6917365
[7,] 1 1 7 -0.6917365
[8,] 1 1 8 -0.6917365
[9,] 1 1 9 -0.6917365
[10,] 1 1 10 -0.6731677
First 10 rows of 25600 printed.
这篇关于为data.frame中缺少值添加行的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!