data.table by = xx当我不返回匹配项时,如何保持长度为0的组 [英] data.table by = xx How do i keep the groups of length 0 when i returns no match
问题描述
我正在尝试解决由于数据量过大而导致无法找到答案的问题。
(即 Data.table:如何获取承诺的快速子集并将其应用于第二个data.table )
I'm trying to work around a problem that has arisen due to the size of my data and that I haven't been able to find an answer to. ( i.e. Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table)
library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)
adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID"
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)
lsr <- cbind.data.frame(
c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"
lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)
adherence <- as.data.table(adherence)
我尝试了不同的方法来获得结果:笛卡尔联接给我的行数超过2 * 31,将无法正常工作。我重写了data.table中的所有内容,这实际上使运行速度降低了几天。我发现,如果我可以得到这一行以返回期望的结果,则可以创建一个for循环,查看 2013-02-01和其他500个时间点,并实现我的梦想(继续下一个问题)。下面的一个子集对我的数据仅花费15s(因此我可以在几个小时内全部运行),但是我的问题是它仅返回具有有价值子集的组。我认为ID:2未返回,因为该组在i中没有匹配项。 -减少操作时间。
I have tried different methods for achieving the result: a cartesian join gives me more than 2*31 rows and won't work. I rewrote everything in data.table and it literally reduced the run speed by days. I've found that if I can get this line to return the desired result I can create a for loop that looks at the "2013-02-01" and 500 other timepoints and achieve my dream (of continuing to another issue). One subset below only takes 15s on my data (so I could run it all in a few hours), but my problem is that it returns only groups with a valued subset. ID:2 is not returned, I think, because the group has no match in i. - reducing the time spend on the operation.
lsr[eksd <= as.Date("2013-02-01") & ENDDATE > as.Date("2013-02-01"), sum(as.numeric(ENDDATE - as.Date("2013-02-01"))), keyby = ID]
ID V1
1: 1 64
2: 3 63
大多数情况下都很聪明,但是我需要有关length = 0的组的信息。(或任何值-我只需要删除ID信息)。这样的方式:
Under most circumstances that is clever, but I need the information about the groups with length = 0. (or whatever value - I just need no to drop the ID information). Somehow like this:
ID V1
1: 1 64
2: 2 0
3: 3 63
我尝试使用tidyr :: complete函数(如此处所述: dplyr摘要:等同于 .drop = FALSE以使组的输出长度保持为零),但dplyr太慢了。我的0.2%的数据需要7个小时。我相信这可以通过某种方式实现。欢迎并提出任何建议。
I tried using the tidyr::complete function (as explained here: dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output) , but dplyr is way too slow. It takes 7 hours on 0,2% of my data. I'm sure this can be achieved somehow. Any suggestions are welcome and appreciated.
推荐答案
OP询问如何填写缺少的 ID
s在上一次聚合期间被删除。
The OP has asked how to fill in the missing ID
s which were dropped during the previous aggregation.
不考虑与OP'S聚合代码相关的性能问题,一种完成 ID
s与唯一的 ID
s联接,直接与上一个操作链接:
Without considering performance issues associated with OP'S aggregation code, one method to complete the ID
s is to join with the unique ID
s, directly chained with the previous operation:
uid <- sort(unique(lsr$ID))
# OP's code
lsr[eksd <= as.Date("2013-02-01") & ENDDATE > as.Date("2013-02-01"),
sum(as.numeric(ENDDATE - as.Date("2013-02-01"))), keyby = ID][
# chained with join to complete IDs
.(ID = uid), on = "ID"][is.na(V1), V1 := 0][]
ID V1
1: 1 64
2: 2 0
3: 3 63
这篇关于data.table by = xx当我不返回匹配项时,如何保持长度为0的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!