data.table 相当于 tidyr::complete() [英] data.table equivalent of tidyr::complete()

查看:20
本文介绍了data.table 相当于 tidyr::complete()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

tidyr::complete() 将行添加到 data.frame 用于数据中缺失的列值的组合.示例:

tidyr::complete() adds rows to a data.frame for combinations of column values that are missing from the data. Example:

library(dplyr)
library(tidyr)

df <- data.frame(person = c(1,2,2),
                 observation_id = c(1,1,2),
                 value = c(1,1,1))
df %>%
  tidyr::complete(person,
                  observation_id,
                  fill = list(value=0))

产量

# A tibble: 4 × 3
  person observation_id value
   <dbl>          <dbl> <dbl>
1      1              1     1
2      1              2     0
3      2              1     1
4      2              2     1

dfperson == 1 和 observation_id == 2 组合的 value> 已经填写了 0 的值.

where the value of the combination person == 1 and observation_id == 2 that is missing in df has been filled in with a value of 0.

data.table 中的这个等价物是什么?

What would be the equivalent of this in data.table?

推荐答案

我认为 data.table 的理念需要的任务的特殊命名函数比你在 tidyverse 中找到的要少,所以需要一些额外的编码,喜欢:

I reckon that the philosophy of data.table entails fewer specially-named functions for tasks than you'll find in the tidyverse, so some extra coding is required, like:

res = setDT(df)[
  CJ(person = person, observation_id = observation_id, unique=TRUE), 
  on=.(person, observation_id)
]

在此之后,您仍然需要手动处理缺失级别的值的填充.我们可以使用 setnafill 来有效地处理这个 &data.table 最新版本中的引用:

After this, you still have to manually handle the filling of values for missing levels. We can use setnafill to handle this efficiently & by-reference in recent versions of data.table:

setnafill(res, fill = 0, cols = 'value')

请参阅 @Jealie 的回答,了解可以避开此问题的功能.

See @Jealie's answer regarding a feature that will sidestep this.

当然,这里的列名必须输入三次,这很疯狂.但另一方面,可以编写一个包装器:

Certainly, it's crazy that the column names have to be entered three times here. But on the other hand, one can write a wrapper:

completeDT <- function(DT, cols, defs = NULL){
  mDT = do.call(CJ, c(DT[, ..cols], list(unique=TRUE)))
  res = DT[mDT, on=names(mDT)]
  if (length(defs)) 
    res[, names(defs) := Map(replace, .SD, lapply(.SD, is.na), defs), .SDcols=names(defs)]
  res[]
} 

completeDT(setDT(df), cols = c("person", "observation_id"), defs = c(value = 0))

   person observation_id value
1:      1              1     1
2:      1              2     0
3:      2              1     1
4:      2              2     1

作为避免在第一步输入三次名称的快速方法,这是@thelatemail 的想法:

As a quick way of avoiding typing the names three times for the first step, here's @thelatemail's idea:

vars <- c("person","observation_id")
df[do.call(CJ, c(mget(vars), unique=TRUE)), on=vars]

# or with magrittr...
c("person","observation_id") %>% df[do.call(CJ, c(mget(.), unique=TRUE)), on=.]

更新:现在您无需在 CJ 中输入两次姓名,这要感谢@MichaelChirico &@MattDowle 用于改进.

Update: now you don't need to enter names twice in CJ thanks to @MichaelChirico & @MattDowle for the improvement.

这篇关于data.table 相当于 tidyr::complete()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆