为data.frame中缺少值添加行的最快方法？ [英] Fastest way to add rows for missing values in a data.frame?

查看：145 发布时间：2017/3/12 9:59:18 r dataframe plyr data.table

本文介绍了为data.frame中缺少值添加行的最快方法？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在数据集中有一个列，其中时间段（ Time ）是从a-b开始的整数。有时，任何给定组可能缺少时间段。我想用 NA 填充这些行。下面是1（几个1000）组的示例数据。

I have a column in my datasets where time periods (Time) are integers ranging from a-b. Sometimes there might be missing time periods for any given group. I'd like to fill in those rows with NA. Below is example data for 1 (of several 1000) group(s).

structure(list(Id = c(1, 1, 1, 1), Time = c(1, 2, 4, 5), Value = c(0.568780482159894, 
-0.7207749516298, 1.24258192959273, 0.682123081696789)), .Names = c("Id", 
"Time", "Value"), row.names = c(NA, 4L), class = "data.frame")


  Id Time      Value
1  1    1  0.5687805
2  1    2 -0.7207750
3  1    4  1.2425819
4  1    5  0.6821231

如您所见，。通常一个或多个可能丢失。我可以自己解决这个问题，但恐怕我不会这样做的最有效的方式。我的方法是创建一个函数：

As you can see, Time 3 is missing. Often one or more could be missing. I can solve this on my own but am afraid I wouldn't be doing this the most efficient way. My approach would be to create a function that:

生成一个从 min（Time）到 max（Time）

然后执行 setdiff 抓取缺少时间的值。

Then do a setdiff to grab missing Time values.

将该向量转换为 data.frame

引用唯一标识符变量（ Id 和上面未列出的其他变量），并将其添加到此data.frame。

Pull unique identifier variables (Id and others not listed above), and add that to this data.frame.

合并两者。

从函数返回。

因此，整个过程将执行如下：

So the entire process would then get executed as below:

   # Split the data into individual data.frames by Id.
    temp_list <- dlply(original_data, .(Id)) 
    # pad each data.frame
    tlist2 <- llply(temp_list, my_pad_function)
    # collapse the list back to a data.frame
    filled_in_data <- ldply(tlist2)

这个？

推荐答案

跟随Ben Barnes的评论，从他的 mydf3 ：

Following up on comments with Ben Barnes and starting with his mydf3 :

DT = as.data.table(mydf3)
setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time)))]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7           NA  NA
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time)))]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5         NA
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7         NA
 [8,]  1   1    8         NA
 [9,]  1   1    9         NA
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed.

$ @
$ b

CJ 代表Cross Join， code>？CJ 。由于 nomatch 默认为 NA NA >。将 nomatch 设置为 0 ，以删除无匹配项。如果不是使用 NA 填充行，则需要添加 roll = TRUE 。这可以比用 NA 填充，然后填充 NA 更有效。请参阅 roll 在？data.table 中的说明。

CJ stands for Cross Join, see ?CJ. The padding with NAs happens because nomatch by default is NA. Set nomatch to 0 instead to remove the no matches. If instead of padding with NAs the prevailing row is required, just add roll=TRUE. This can be more efficient than padding with NAs and then filling NAs afterwards. See the description of roll in ?data.table.

setkey(DT,Id,Time)
DT[CJ(unique(Id),seq(min(Time),max(Time))),roll=TRUE]
      Id Time        Value Id2
 [1,]  1    1 -0.262482283   2
 [2,]  1    2 -1.423935165   2
 [3,]  1    3  0.500523295   1
 [4,]  1    4 -1.912687398   1
 [5,]  1    5 -1.459766444   2
 [6,]  1    6 -0.691736451   1
 [7,]  1    7 -0.691736451   1
 [8,]  1    8  0.001041489   2
 [9,]  1    9  0.495820559   2
[10,]  1   10 -0.673167744   1
First 10 rows of 12800 printed. 

setkey(DT,Id,Id2,Time)
DT[CJ(unique(Id),unique(Id2),seq(min(Time),max(Time))),roll=TRUE]
      Id Id2 Time      Value
 [1,]  1   1    1         NA
 [2,]  1   1    2         NA
 [3,]  1   1    3  0.5005233
 [4,]  1   1    4 -1.9126874
 [5,]  1   1    5 -1.9126874
 [6,]  1   1    6 -0.6917365
 [7,]  1   1    7 -0.6917365
 [8,]  1   1    8 -0.6917365
 [9,]  1   1    9 -0.6917365
[10,]  1   1   10 -0.6731677
First 10 rows of 25600 printed.

这篇关于为data.frame中缺少值添加行的最快方法？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为data.frame中缺少值添加行的最快方法？ [英] Fastest way to add rows for missing values in a data.frame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为data.frame中缺少值添加行的最快方法？ [英] Fastest way to add rows for missing values in a data.frame?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭