更有效的方法计算每个案件的创建时间的开箱 [英] More efficient method for counting open cases as of creation time of each case

查看:125
本文介绍了更有效的方法计算每个案件的创建时间的开箱的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想找一种更有效的方法来计算每个案例的创建时间开放的案例数。案例在其创建日期/时间戳与其检查器日期/时间戳之间是打开的。您可以复制并粘贴以下代码,查看一个简单的功能示例:

I am trying find a more efficient way to count the number of cases that are open as of the creation time of each case. A case is "open" between its creation date/time stamp and its censor date/time stamp. You can copy-paste the code below to view a simple functional example:

# Create a bunch of date/time stamps for our example
two_thousand                <- as.POSIXct("2000-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_one            <- as.POSIXct("2001-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_two            <- as.POSIXct("2002-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_three          <- as.POSIXct("2003-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_four           <- as.POSIXct("2004-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_five           <- as.POSIXct("2005-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_six            <- as.POSIXct("2006-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_seven          <- as.POSIXct("2007-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_eight          <- as.POSIXct("2008-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_nine           <- as.POSIXct("2009-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_ten            <- as.POSIXct("2010-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_eleven         <- as.POSIXct("2011-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");

mid_two_thousand            <- as.POSIXct("2000-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_one        <- as.POSIXct("2001-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_mid_two    <- as.POSIXct("2002-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_three      <- as.POSIXct("2003-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_four       <- as.POSIXct("2004-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_five       <- as.POSIXct("2005-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_six        <- as.POSIXct("2006-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_seven      <- as.POSIXct("2007-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_eight      <- as.POSIXct("2008-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_nine       <- as.POSIXct("2009-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_ten        <- as.POSIXct("2010-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_eleven     <- as.POSIXct("2011-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");

# Create a table that has pairs of created & censored date/time stamps for cases, indicating the range during which each case is "open"
comparison_table    <- data.table(id        = 1:10,
                                  created   = c(two_thousand, two_thousand_two, two_thousand_four, two_thousand_six, two_thousand_eight, two_thousand_ten, two_thousand, two_thousand_six, two_thousand_three, two_thousand_one),
                                  censored  = c(two_thousand_one, two_thousand_three, two_thousand_five, two_thousand_seven, two_thousand_nine, two_thousand_eleven, two_thousand_five, two_thousand_ten, two_thousand_eight, two_thousand_four));

# Create a table that has the creation date/time stamps at which we want to count all the open cases
check_table         <- data.table(id        = 1:12,
                                  creation  = c(mid_two_thousand, mid_two_thousand_one, mid_two_thousand_mid_two, mid_two_thousand_three, mid_two_thousand_four, mid_two_thousand_five, mid_two_thousand_six, mid_two_thousand_seven, mid_two_thousand_eight, mid_two_thousand_nine, mid_two_thousand_ten, mid_two_thousand_eleven)); 

# I use the DPLYR library as the group_by() + summarize() functions make this operation simple
library(dplyr);

# Group by id to set parameter for summarize() function 
check_table_grouped <- group_by(check_table, id);

# For each id in the table, sum the number of times that its creation date/time stamp is greater than the creation date/time and less than the censor date/time of all cases in the comparison table
# EDIT: Also added timing to compare with method below
system.time(check_table_summary <- summarize(check_table_grouped, other_open_values_at_creation_count = sum((comparison_table$created < creation & comparison_table$censored > creation))));

# Result is as desired
check_table_summary;              

# EDIT: Added @David-arenburg's solution with timing
library(data.table);
setDT(check_table)[, creation2 := creation];
setkey(comparison_table, created, censored);
system.time(foverlaps_table <- foverlaps(check_table, comparison_table, by.x = c("creation", "creation2"))[, sum(!is.na(id)), by = i.id]);

# Same results as above
foverlaps_table;

此方法适用于小数据集,如本示例中的数据集。然而,即使我使用向量化操作,计算时间也呈指数增长,因为操作计数是:(3 * nrow比较)*(nrow sum(nrow)计算)。在nrow = 10,000时,时间为约14s,在nrow = 100,000时,时间> 20分钟。我的实际nrow是〜1,000,000。

This approach works fine for small data sets like the one in this example. However, even though I'm using vectorized operations, the computation time grows exponentially because the operation count is: (3 * nrow comparisons) * (nrow sum(nrow) calculations). At nrow=10,000, time is around 14s, at nrow=100,000, time is > 20 minutes. My actual nrow is ~ 1,000,000.

有更有效的方法来做这个计算吗?我目前正在研究多核选项,但即使这些只会线性减少执行时间。您的帮助是赞赏。谢谢!

Is there a more efficient way to do this calculation? I'm currently looking into multicore options, but even those will only linearly reduce the execution time. Your help is appreciated. Thanks!

编辑:添加@ David-arenburg的 data.table :: foverlaps 解决方案,更快, 1000.但是,对于更大数量的行,它比 summarize 解决方案慢。在10,000行,它是两倍长。在50,000行,我放弃了等待10倍之后。有趣的是, foverlaps 解决方案似乎不会触发自动垃圾收集,所以不断坐在最大RAM(64GB在我的系统),而总结解决方案定期触发自动垃圾收集,所以从不超过〜40GB的RAM。我不确定这是否与速度差异有关。

Added @David-arenburg's data.table::foverlaps solution, which also works and is faster for nrow < 1000. However, it is slower than the summarize solution for larger numbers of rows. At 10,000 rows it was twice as long. At 50,000 rows, I gave up waiting after 10x as long. Interestingly, the foverlaps solution doesn't seem to trigger automatic garbage collection, so constantly sits at max RAM (64GB on my system) whereas the summarize solution periodically triggers the automatic garbage collection, so never exceeds ~ 40GB of RAM. I'm unsure if this is related to the speed differences.

FINAL EDIT:我已经以一种方式重写了问题,使受访者更容易生成具有适当创建/删除日期时间的大表。我还简化和解释了这个问题更清楚,清楚表明查找表非常大(违反 data.table :: foverlaps 假设)。我甚至内置时序比较,使它超级简单的测试与大案例!详情如下:高效在大数据集中提交每个案例时计算打开案例的方法

FINAL I have re-written the question in a way that makes it much easier for respondents to generate large tables with suitable created/censored dateTimes. I've also simplified and explained the problem more clearly, making it clear that the lookup table is very large (violating data.table::foverlaps assumptions). I've even built in timing comparison to make it super simple to test with large cases! Details here: Efficient method for counting open cases at time of each case's submission in large data set

再次感谢您的帮助大家! :)

Thanks again for your help everyone! :)

推荐答案

另一个 foverlaps 解决方案。假设 comparison_table 不太大

Yet another foverlaps solution. Assuming that comparison_table is not too large

library(data.table);
setkey(comparison_table, created, censored);    
times <- sort(unique(c(comparison_table$created, comparison_table$censored)))
dt <- data.table(creation=times+1)[, creation2 := creation];
setkey(dt, creation, creation2)
x <- foverlaps(comparison_table, dt, by.x = c("created", "censored"))[,.N,creation]$N
check_table$newcol <- x[findInterval(check_table$creation, times)]

这篇关于更有效的方法计算每个案件的创建时间的开箱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆