data.table by = xx当我不返回匹配项时，如何保持长度为0的组 [英] data.table by = xx How do i keep the groups of length 0 when i returns no match

查看：62 发布时间：2020/10/15 20:57:10 r data.table grouping subset

本文介绍了data.table by = xx当我不返回匹配项时，如何保持长度为0的组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试解决由于数据量过大而导致无法找到答案的问题。
（即 Data.table：如何获取承诺的快速子集并将其应用于第二个data.table ）

I'm trying to work around a problem that has arisen due to the size of my data and that I haven't been able to find an answer to. ( i.e. Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table)

library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)

adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID" 
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)

lsr <- cbind.data.frame(
  c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
  c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
  c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"

lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)

adherence <- as.data.table(adherence)

我尝试了不同的方法来获得结果：笛卡尔联接给我的行数超过2 * 31，将无法正常工作。我重写了data.table中的所有内容，这实际上使运行速度降低了几天。我发现，如果我可以得到这一行以返回期望的结果，则可以创建一个for循环，查看 2013-02-01和其他500个时间点，并实现我的梦想（继续下一个问题）。下面的一个子集对我的数据仅花费15s（因此我可以在几个小时内全部运行），但是我的问题是它仅返回具有有价值子集的组。我认为ID：2未返回，因为该组在i中没有匹配项。 -减少操作时间。

I have tried different methods for achieving the result: a cartesian join gives me more than 2*31 rows and won't work. I rewrote everything in data.table and it literally reduced the run speed by days. I've found that if I can get this line to return the desired result I can create a for loop that looks at the "2013-02-01" and 500 other timepoints and achieve my dream (of continuing to another issue). One subset below only takes 15s on my data (so I could run it all in a few hours), but my problem is that it returns only groups with a valued subset. ID:2 is not returned, I think, because the group has no match in i. - reducing the time spend on the operation.

lsr[eksd <= as.Date("2013-02-01") & ENDDATE > as.Date("2013-02-01"), sum(as.numeric(ENDDATE - as.Date("2013-02-01"))), keyby = ID]


    ID V1
1:  1 64
2:  3 63

大多数情况下都很聪明，但是我需要有关length = 0的组的信息。（或任何值-我只需要删除ID信息）。这样的方式：

Under most circumstances that is clever, but I need the information about the groups with length = 0. (or whatever value - I just need no to drop the ID information). Somehow like this:

我尝试使用tidyr :: complete函数（如此处所述： dplyr摘要：等同于 .drop = FALSE以使组的输出长度保持为零），但dplyr太慢了。我的0.2％的数据需要7个小时。我相信这可以通过某种方式实现。欢迎并提出任何建议。

I tried using the tidyr::complete function (as explained here: dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output) , but dplyr is way too slow. It takes 7 hours on 0,2% of my data. I'm sure this can be achieved somehow. Any suggestions are welcome and appreciated.

data.table by = xx当我不返回匹配项时，如何保持长度为0的组 [英] data.table by = xx How do i keep the groups of length 0 when i returns no match

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

data.table by = xx当我不返回匹配项时，如何保持长度为0的组 [英] data.table by = xx How do i keep the groups of length 0 when i returns no match

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭