data.table by = xx当我不返回匹配项时,如何保持长度为0的组 [英] data.table by = xx How do i keep the groups of length 0 when i returns no match

查看:62
本文介绍了data.table by = xx当我不返回匹配项时,如何保持长度为0的组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解决由于数据量过大而导致无法找到答案的问题。
(即 Data.table:如何获取承诺的快速子集并将其应用于第二个data.table

I'm trying to work around a problem that has arisen due to the size of my data and that I haven't been able to find an answer to. ( i.e. Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table)

library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)

adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID" 
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)

lsr <- cbind.data.frame(
  c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
  c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
  c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"

lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)

adherence <- as.data.table(adherence)

我尝试了不同的方法来获得结果:笛卡尔联接给我的行数超过2 * 31,将无法正常工作。我重写了data.table中的所有内容,这实际上使运行速度降低了几天。我发现,如果我可以得到这一行以返回期望的结果,则可以创建一个for循环,查看 2013-02-01和其他500个时间点,并实现我的梦想(继续下一个问题)。下面的一个子集对我的数据仅花费15s(因此我可以在几个小时内全部运行),但是我的问题是它仅返回具有有价值子集的组。我认为ID:2未返回,因为该组在i中没有匹配项。 -减少操作时间。

I have tried different methods for achieving the result: a cartesian join gives me more than 2*31 rows and won't work. I rewrote everything in data.table and it literally reduced the run speed by days. I've found that if I can get this line to return the desired result I can create a for loop that looks at the "2013-02-01" and 500 other timepoints and achieve my dream (of continuing to another issue). One subset below only takes 15s on my data (so I could run it all in a few hours), but my problem is that it returns only groups with a valued subset. ID:2 is not returned, I think, because the group has no match in i. - reducing the time spend on the operation.

lsr[eksd <= as.Date("2013-02-01") & ENDDATE > as.Date("2013-02-01"), sum(as.numeric(ENDDATE - as.Date("2013-02-01"))), keyby = ID]


    ID V1
1:  1 64
2:  3 63

大多数情况下都很聪明,但是我需要有关length = 0的组的信息。(或任何值-我只需要删除ID信息)。这样的方式:

Under most circumstances that is clever, but I need the information about the groups with length = 0. (or whatever value - I just need no to drop the ID information). Somehow like this:

   ID V1
1:  1 64
2:  2 0
3:  3 63

我尝试使用tidyr :: complete函数(如此处所述: dplyr摘要:等同于 .drop = FALSE以使组的输出长度保持为零),但dplyr太慢了。我的0.2%的数据需要7个小时。我相信这可以通过某种方式实现。欢迎并提出任何建议。

I tried using the tidyr::complete function (as explained here: dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output) , but dplyr is way too slow. It takes 7 hours on 0,2% of my data. I'm sure this can be achieved somehow. Any suggestions are welcome and appreciated.

推荐答案

OP询问如何填写缺少的 ID s在上一次聚合期间被删除。

The OP has asked how to fill in the missing IDs which were dropped during the previous aggregation.

不考虑与OP'S聚合代码相关的性能问题,一种完成 ID s与唯一的 ID s联接,直接与上一个操作链接:

Without considering performance issues associated with OP'S aggregation code, one method to complete the IDs is to join with the unique IDs, directly chained with the previous operation:

uid <- sort(unique(lsr$ID))
# OP's code
lsr[eksd <= as.Date("2013-02-01") & ENDDATE > as.Date("2013-02-01"), 
    sum(as.numeric(ENDDATE - as.Date("2013-02-01"))), keyby = ID][
      # chained with join to complete IDs
      .(ID = uid), on = "ID"][is.na(V1), V1 := 0][]




   ID V1
1:  1 64
2:  2  0
3:  3 63


这篇关于data.table by = xx当我不返回匹配项时,如何保持长度为0的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆