在大型数据集中每次提交案件的时候,开放案件的计数方法是有效的 [英] Efficient method for counting open cases at time of each case's submission in large data set

查看:119
本文介绍了在大型数据集中每次提交案件的时候,开放案件的计数方法是有效的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在一个大型数据集(〜1M个案例)中,每个案例都有一个创建和审查 dateTime 。我想计算在每个案例创建时打开的其他案例的数量。案例在他们的创建和被审查之间打开 dataTimes

In a large data set (~1M cases), each case has a "created" and a "censored" dateTime. I want to count the number of other cases that were open at the time each case was created. Cases are open between their "created" and "censored" dataTimes.

几个解决方案适用于小数据集<100,000个案例),但计算时间呈指数增长。我的估计是,计算时间随着函数3n ^ 2而增加。在n = 100,000的情况下,我的服务器的计算时间> 20分钟,6 * 4GHz内核和64GB RAM。即使使用多核心图书馆,最多也可以将时间缩短到8或10倍。不足以处理〜1M个案例。

Several solutions work well on small datasets (<100,000 cases), but computation time grows exponentially. My estimate is that computation time increases as a function 3n^2. By n=100,000 cases, computation time is >20 mins on my server with 6 * 4GHz cores & 64GB RAM. Even with multi-core libraries, at best, I could reduce the time by a factor of 8 or 10. Not enough to handle ~1M cases.

我正在寻找为了更有效的方法来做这个计算。下面我提供了一个功能,允许您轻松创建大量的创建和审查 dateTime 对以及迄今为止尝试的两种解决方案,使用 dplyr data.table 库。为了简单起见,向用户报告定时。您可以在顶部更改CASE_COUNT变量,重新执行并重新查看次数,轻松比较您可能需要提出的其他解决方案的时间。

I'm looking for a more efficient method to do this calculation. Below I have provided a function that allows you to easily create large numbers of "created" and "censored" dateTime pairs along with two solutions tried so far, using dplyr and data.table libraries. The timings are reported to the user for simplicity. You can simply change the "CASE_COUNT" variable at the top to re-execute and view times again and easily compare the timing of other solutions you might have to suggest.

I将使用其他解决方案更新原始帖子,以便为作者提供适当的信用。提前感谢您的帮助!

I will update the original post with other solutions to give proper credit to their authors. Thanks in advance for your help with this!

# Load libraries used in this example
library(dplyr);
library(data.table);
# Not on CRAN. See: http://bioconductor.org/packages/release/bioc/html/IRanges.html
library(IRanges);

# Set seed for reproducibility 
set.seed(123)

# Set number of cases & date range variables
CASE_COUNT  <<- 1000;
RANGE_START <- as.POSIXct("2000-01-01 00:00:00", 
                          format="%Y-%m-%d %H:%M:%S", 
                          tz="UTC", origin="1970-01-01");
RANGE_END   <- as.POSIXct("2012-01-01 00:00:00", 
                          format="%Y-%m-%d %H:%M:%S", 
                          tz="UTC", origin="1970-01-01");

# Select which solutions you want to run in this test           
RUN_SOLUTION_1 <- TRUE;     # dplyr::summarize() + comparisons
RUN_SOLUTION_2 <- TRUE;     # data.table:foverlaps()
RUN_SOLUTION_3 <- TRUE;     # data.table aggregation + comparisons
RUN_SOLUTION_4 <- TRUE;     # IRanges::IRanges + countOverlaps()
RUN_SOLUTION_5 <- TRUE;     # data.table::frank()

# Function to generate random creation & censor dateTime pairs
# The censor time always has to be after the creation time
# Credit to @DirkEddelbuettel for this smart function
# (https://stackoverflow.com/users/143305/dirk-eddelbuettel)

generate_cases_table <- function(n = CASE_COUNT, start_val=RANGE_START, end_val=RANGE_END) {
    # Measure duration between start_val & end_val
    duration <- as.numeric(difftime(end_val, start_val, unit="secs"));

    # Select random values in duration to create start_offset
    start_offset   <- runif(n, 0, duration);

    # Calculate the creation time list
    created_list  <- start_offset + start_val;

    # Calculate acceptable time range for censored values
    # since they must always be after their respective creation value
    censored_range <- as.numeric(difftime(RANGE_END, created_list, unit="secs"));

    # Select random values in duration to create end_offset
    creation_to_censored_times <- runif(n, 0, censored_range);

    censored_list <- created_list + creation_to_censored_times;

    # Create and return a data.table with creation & censor values
    # calculated from start or end with random offsets
    return_table  <- data.table(id       = 1:n,
                                created  = created_list,
                                censored = censored_list);

    return(return_table);
}

# Create the data table with the desired number of cases specified by CASE_COUNT above
cases_table <- generate_cases_table();

solution_1_function <- function (cases_table) { 
    # SOLUTION 1: Using dplyr::summarize:

    # Group by id to set parameters for summarize() function 
    cases_table_grouped <- group_by(cases_table, id);

    # Count the instances where other cases were created before
    # and censored after each case using vectorized sum() within summarize()

    cases_table_summary <- summarize(cases_table_grouped, 
                           open_cases_at_creation = sum((cases_table$created  < created & 
                                                         cases_table$censored > created)));
    solution_1_table <<- as.data.table(cases_table_summary, key="id");        
} # End solution_1_function

solution_2_function <- function (cases_table) {
    # SOLUTION 2: Using data.table::foverlaps:

    # Adapted from solution provided by @Davidarenburg
    # (https://stackoverflow.com/users/3001626/david-arenburg)

    # The foverlaps() solution tends to crash R with large case counts
    # I suspect it has to do with memory assignment of the very large objects
    # It maxes RAM on my system (64GB) before crashing, possibly attempting
    # to write beyond its assigned memory limits.
    # I'll submit a reproduceable bug to the data.table team since
    # foverlaps() is pretty new and known to be occasionally unstable

    if (CASE_COUNT > 50000) {
        stop("The foverlaps() solution tends to crash R with large case counts. Not running.");
    }

    setDT(cases_table)[, created_dupe := created];
    setkey(cases_table, created, censored);

    foverlaps_table  <- foverlaps(cases_table[,c("id","created","created_dupe"), with=FALSE],
                                  cases_table[,c("id","created","censored"),    with=FALSE], 
                                  by.x=c("created","created_dupe"))[order(i.id),.N-1,by=i.id];

    foverlaps_table  <- dplyr::rename(foverlaps_table, id=i.id, open_cases_at_creation=V1);

    solution_2_table <<- as.data.table(foverlaps_table, key="id");
} # End solution_2_function

solution_3_function <- function (cases_table) {    
    # SOLUTION 3: Using data.table aggregation instead of dplyr::summarize

    # Idea suggested by @jangorecki
    # (https://stackoverflow.com/users/2490497/jangorecki)

    # Count the instances where other cases were created before
    # and censored after each case using vectorized sum() with data.table aggregation

    cases_table_aggregated <- cases_table[order(id), sum((cases_table$created  < created & 
                                                     cases_table$censored > created)),by=id];   

    solution_3_table <<- as.data.table(dplyr::rename(cases_table_aggregated, open_cases_at_creation=V1), key="id");

} # End solution_3_function

solution_4_function <- function (cases_table) { 
    # SOLUTION 4: Using IRanges package

    # Adapted from solution suggested by @alexis_laz
    # (https://stackoverflow.com/users/2414948/alexis-laz)

    # The IRanges package generates ranges efficiently, intended for genome sequencing
    # but working perfectly well on this data, since POSIXct values are numeric-representable
    solution_4_table <<- data.table(id      = cases_table$id,
                     open_cases_at_creation = countOverlaps(IRanges(cases_table$created, 
                                                                    cases_table$created), 
                                                            IRanges(cases_table$created, 
                                                                    cases_table$censored))-1, key="id");

} # End solution_4_function

solution_5_function <- function (cases_table) {
    # SOLUTION 5: Using data.table::frank()

    # Adapted from solution suggested by @danas.zuokas
    # (https://stackoverflow.com/users/1249481/danas-zuokas)

    n <- CASE_COUNT;

    # For every case compute the number of other cases
    # with `created` less than `created` of other cases
    r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]), ties.method = 'first')[1:n];

    # For every case compute the number of other cases
    # with `censored` less than `created`
    r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]), ties.method = 'first')[1:n];

    solution_5_table <<- data.table(id      = cases_table$id,
                     open_cases_at_creation = r1 - r2, key="id");

} # End solution_5_function;

# Execute user specified functions;
if (RUN_SOLUTION_1)
    solution_1_timing <- system.time(solution_1_function(cases_table)); 
if (RUN_SOLUTION_2) {
    solution_2_timing <- try(system.time(solution_2_function(cases_table)));
    cases_table <- select(cases_table, -created_dupe);
}
if (RUN_SOLUTION_3)
    solution_3_timing <- system.time(solution_3_function(cases_table)); 
if (RUN_SOLUTION_4)
    solution_4_timing <- system.time(solution_4_function(cases_table));
if (RUN_SOLUTION_5)
    solution_5_timing <- system.time(solution_5_function(cases_table));         

# Check generated tables for comparison
if (RUN_SOLUTION_1 && RUN_SOLUTION_2 && class(solution_2_timing)!="try-error") {
    same_check1_2 <- all(solution_1_table$open_cases_at_creation == solution_2_table$open_cases_at_creation);
} else {same_check1_2 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_3) {
    same_check1_3 <- all(solution_1_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check1_3 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_4) {
    same_check1_4 <- all(solution_1_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check1_4 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_5) {
    same_check1_5 <- all(solution_1_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check1_5 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_3 && class(solution_2_timing)!="try-error") {
    same_check2_3 <- all(solution_2_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check2_3 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_4 && class(solution_2_timing)!="try-error") {
    same_check2_4 <- all(solution_2_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check2_4 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_5 && class(solution_2_timing)!="try-error") {
    same_check2_5 <- all(solution_2_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check2_5 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_4) {
    same_check3_4 <- all(solution_3_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check3_4 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_5) {
    same_check3_5 <- all(solution_3_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check3_5 <- TRUE;}
if (RUN_SOLUTION_4 && RUN_SOLUTION_5) {
    same_check4_5 <- all(solution_4_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check4_5 <- TRUE;}


same_check    <- all(same_check1_2, same_check1_3, same_check1_4, same_check1_5,
                     same_check2_3, same_check2_4, same_check2_5, same_check3_4,
                     same_check3_5, same_check4_5);

# Report summary of results to user
cat("This execution was for", CASE_COUNT, "cases.\n",
    "It is", same_check, "that all solutions match.\n");
if (RUN_SOLUTION_1)
    cat("The dplyr::summarize() solution took", solution_1_timing[3], "seconds.\n");
if (RUN_SOLUTION_2 && class(solution_2_timing)!="try-error")
    cat("The data.table::foverlaps() solution took", solution_2_timing[3], "seconds.\n");
if (RUN_SOLUTION_3)
    cat("The data.table aggregation solution took", solution_3_timing[3], "seconds.\n");
if (RUN_SOLUTION_4)
    cat("The IRanges solution solution took", solution_4_timing[3], "seconds.\n");
if (RUN_SOLUTION_5)
    cat("The data.table:frank() solution solution took", solution_5_timing[3], "seconds.\n\n");

data.table :: foverlaps()解决方案对于较少的案例(<5000左右;取决于随机性除了n,因为它使用二进制搜索进行优化)更快。对于更多案例(> 5,000左右), dplyr :: summarize()解决方案更快。超过100,000,这两个解决方案都是可行的,因为它们都太慢了。

The data.table::foverlaps() solution is faster for fewer cases (<5000 or so; depends on randomness in addition to n since it uses binary search to optimize). The dplyr::summarize() solution is faster for more cases (>5,000 or so). Much beyond 100,000, neither solution is viable as they're both too slow.

编辑:根据@jangorecki建议的使用 data.table 聚合而不是 dplyr :: summarize(),否则类似于 dplyr 解决方案。最多可达5万例,是最快的解决方案。超过50,000个案例, dplyr :: summarize()解决方案稍微快一些,但不会太多。不幸的是,对于1M个案例,它仍然不实用。

Added a third solution based on the idea suggested by @jangorecki that uses data.table aggregation instead of dplyr::summarize(), and is otherwise similar to the dplyr solution. For up to around 50,000 cases, it is the fastest solution. Beyond 50,000 cases, the dplyr::summarize() solution is slightly faster, but not by much. Sadly, for 1M cases it's still not practical.

EDIT2:添加了第四个解决方案,该解决方案由@alexis_laz提出的解决方案进行了改进,该解决方案使用 / code>包及其 countOverlaps 函数。
它比其他3种解决方案快得多。有50,000个案例,比解决方案1& 3。

Added a fourth solution adapted from the solution suggested by @alexis_laz that uses the IRanges package and its countOverlaps function. It is significantly faster than the 3 other solutions. With 50,000 cases it was almost 400% faster than solutions 1 & 3.

EDIT3:修改案例生成函数,以适当地执行被删截的条件。感谢@jangorecki抓住以前版本的限制。

Modified case generating function to properly exercise the "censored" condition. Thanks to @jangorecki for catching the limitation of the previous version.

EDIT4:重写以允许用户选择要执行的解决方案并使用系统.time()用于与每次执行之前的垃圾收集进行时序比较以获得更准确的时间(根据@ jangorecki的精明观察) - 还为崩溃案例添加了一些条件检查。

Rewrote to allow user to select which solutions to execute and to use system.time() for timing comparison with garbage collection before each execution for more accurate timing (as per @jangorecki's astute observation) - Also added some condition checks for crash cases.

EDIT5:添加了使用 rank()的@ danas.zuokas建议的解决方案改编的第五个解决方案。我的实验表明,它总是比其他解决方案慢至少一个数量级。在$ code> iranges 解决方案中,对于$ code> dplyr ::总结和0.36秒,需要44秒,而3.5秒。

Added a fifth solution adapted from the solution suggested by @danas.zuokas using rank(). My experimentation suggests that it's always at least an order of magnitude slower than the other solutions. At 10,000 cases, it takes 44 seconds vs. 3.5 seconds for dplyr::summarize and 0.36 seconds for IRanges solutions.

最终编辑:我对@ danas.zuokas建议的解决方案5进行了一些修改,并且通过@Khashaa的观察结果对类型进行了匹配。我在 dataTime 生成函数中设置了 as.numeric 类型,这大大加快了整数双打而不是 dateTime中执行排名 对象(提高其他功能的速度,但不会大幅提高)。通过一些测试,设置 ties.method ='first'产生符合意图的结果。 data.table :: frank base :: rank IRanges ::等级 bit64 :: rank 是最快的,但它似乎处理不同于 data.table :: frank 的关系,我可以不要按要求处理它们。一旦 bit64 加载,它会屏蔽大量的类型和函数,更改 data.table :: frank

FINAL I've made slight modifications to solution 5 that was suggested by @danas.zuokas, and matching the observation by @Khashaa about types. I've set the type as.numeric in the dataTime generation function which drastically speeds up rank as it operates on integers or doubles instead of dateTime objects(increases speed of other functions too, but not as drastically). With some testing, setting ties.method='first' yields results consistent with the intent. data.table::frank is faster than both base::rank and IRanges::rank. bit64::rank is fastest, but it seems to handle ties differently than data.table::frank and I can't get it to handle them as desired. Once bit64 is loaded, it masks a large number of types and functions, changing the results of data.table::frank along the way. The specific reasons why are beyond the scope of this question.

POST END注意:结果是 data.table :: frank 有效地处理 POSIXct dateTimes ,而 base :: rank IRanges :: rank 似乎。因此,即使 as.numeric (或 as.integer )类型设置也不是必需的,而 data.table :: frank 并且转换精度没有损失,所以有更少的 ties.method 差异。
谢谢大家谁贡献!我学到了很多!非常感激! :)
信用将被包含在我的源代码中。

POST END NOTE: Turns out that data.table::frank handles POSIXct dateTimes efficiently, whereas neither base::rank nor IRanges::rank seem to. As such, even the as.numeric (or as.integer) type setting isn't necessary with data.table::frank and there is no loss of precision from the conversion, so there are fewer ties.method discrepancies. Thank you to everyone who contributed! I learned a lot! Much appreciated! :) Credit will be included in my source-code.

ENDNOTE:这个问题是一个精致和澄清的版本,更易于使用和更易读的例子代码,更有效的计数开放案例的方法,从每个案例的创建时间 - 我把它分开了,不压倒原始帖子编辑太多,并简化了示例代码中大量 dataTime 对的创建。这样,你就不用努力去回答。再次感谢

ENDNOTE: This question is a refined and clarified version, with easier to use and more readable example code, of More efficient method for counting open cases as of creation time of each case - I've separated it here to not overwhelm the original post with too many edits and to simplify the creation of large numbers of dataTime pairs in the example code. That way, you don't have to work as hard to answer. Thanks again!

推荐答案

答案是由问题的作者对评论的更新。 strong>

The answer is updated with the regards to a comment by the author of the question.

我建议使用排名的解决方案。表的创建方式如后续的这个问题,或者在本问题中使用 dateTime 对生成函数。两者都应该工作。

I would suggest a solution using ranks. Tables are created as in a follow up to this question, or using the dateTime pairs generation function in the present question. Both should work.

n <- cases_table[, .N]

# For every case compute the number of other cases
# with `created` less than `creation` of other cases
r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]),
           ties.method = 'first')[1:n]

# For every case compute the number of other cases
# with `censored` less than `created`
r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]),
           ties.method = 'first')[1:n]

取得差异 r1 - r2 (-1需要使用ties.method ='first')给出结果(消除创建的排名)。在效率方面,只需在 cases_table 中查找该行数的向量的向量。 data.table :: frank handle POSIXct dateTime 快速地作为数字对象(不同于 base :: rank ),所以不需要类型转换。

Taking difference r1 - r2 (-1 not required with ties.method='first') gives the result (eliminating ranks of created). In terms of efficiency it takes only finding ranks of a vector of length of that number of rows in cases_table. data.table::frank handles POSIXct dateTime objects as quickly as numeric objects (unlike base::rank), so no type conversion is required.

这篇关于在大型数据集中每次提交案件的时候,开放案件的计数方法是有效的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆