Data.table:如何获得它承诺的惊人的快速子集并将其应用于第二个data.table [英] Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table

查看:50
本文介绍了Data.table:如何获得它承诺的惊人的快速子集并将其应用于第二个data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据另一个数据集(lsr)的子集来丰富一个数据集(依从性).对于依从性的每一行,我想计算(作为第三列)可用于实施处方方案的药物.我有一个返回相关结果的函数,但它只在我必须对其运行的全部数据的一部分上运行了几天.

I'm trying to enrich one dataset (adherence) based on subsets from another (lsr). For each individual row in adherence, I want to calculate (as a third column) the medication available for implementing the prescribed regimen. I have a function that returns the relevant result, but it runs for days on just a subset of the total data I have to run it on.

数据集是:

 library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)

adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID" 
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)

lsr <- cbind.data.frame(
  c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
  c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
  c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"

lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)

adherence <- as.data.table(adherence)

我曾经与dplyr一起工作,但是速度慢得多,我将其重写为data.table进行尝试.我与SAS一起工作的同事声称这对于他们来说不会花很长时间,这使我发疯,因为我只花了几个小时就将数据本身加载到RAM中. (fread在我的一些数据集上使R崩溃). 坚持是1,5 mio行,而lsr是几百mio.行.

I'm used to working with dplyr, but it was much slower and I rewrote things for data.table to try it out. It is driving me crazy that my colleagues working with SAS claims that this wouldn't take long for them, when it takes me hours just to load the data itself into RAM. (fread crashes R on several of my datasets). Adherence is 1,5 mio rows, and lsr is a few hundred mio. rows.

我的工作职能是

function.AH <- function(x) {
  lsr[ID == x[1] & eksd <= x[2] & ENDDATE > x[2], ifelse(.N == 0, 0, sum(as.numeric(ENDDATE - as.Date(x[2]))))]
}
setkey(lsr, ID, eksd, ENDDATE)
adherence$AH <-apply (adherence, 1,  FUN = function.AH) #DESIRED OUTPUT

我不知道最好的方法:我已经研究过使用SQL数据库,但是据我了解,当我的数据适合RAM(我有256GB)时,这应该不会更快.由于遵守data.table实际上是每个单独的ID重复500个时间段(即ID 1:在时间1,时间2,时间3 ...时间500,ID 2:在时间1,时间2 ...等)我还考虑过在lsr的ID上使用by函数,以及在j中的函数中如何嵌入此时间间隔(1:500).

I don't know the best approach: I've looked into using a SQL database, but as I understand it this shouldn't be faster when my data fits into RAM (I have 256GB). Since the adherence data.table is actually each individual ID repeated for 500 timeperiods (i.e. ID 1: at time 1, time 2, time 3...time 500, ID 2: at time 1, time 2... etc.)I also considered using the by function on ID on lsr and some how imbedding this time interval (1:500) in the function in j.

我希望有人能指出我如何有效地使用apply函数,而不是以某种方式将其应用到data.table-framework中,从而失去了构建效率.但是,由于我将要使用这些数据和类似大小的数据,因此,我希望您能对解决此问题的任何具体建议或使用其他方法获得更快运行时间的一般建议感到满意.

I hope that some-one can point out how I'm using the apply function inefficiently by not somehow applying it inside the data.table-framework and thus loosing the build in efficiency. But as I'm going to be working with this data and similar sizes of data, I'd appreciate any specific suggestions for solving this faster or general suggestions for getting faster running times using other methods.

推荐答案

这可以通过通过非等额联接进行更新来解决.

这避免了由笛卡尔联接或调用apply()导致的内存问题,该调用将data.frame或data.table强制转换为涉及复制数据的矩阵.

This avoids the memory issues caused by a cartesian join or by calling apply() which coerces a data.frame or data.table to a matrix which involves copying the data.

此外,OP还提到lsr几百个mio.行adherence具有1.5个mio行(500个时间段乘以3000个ID的时间).因此,有效地存储数据项不仅会减少内存占用,而且会减少加载数据所需的处理时间份额.

In addition, the OP has mentioned that lsr has a few hundred mio. rows and adherence has 1.5 mio rows (500 timeperiods times 3000 ID's). Therefore, efficient storage of data items will not only reduce the memory footprint but may also reduce the share of processing time which is required for loading data.

library(data.table)
# coerce to data.table by reference, i.e., without copying
setDT(adherence)
setDT(lsr)
# coerce to IDate to save memory
adherence[, year := as.IDate(year)]
cols <- c("eksd", "ENDDATE")
lsr[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# update in a non-equi join
adherence[lsr, on = .(ID, year >= eksd, year < ENDDATE), 
                      AH := as.integer(ENDDATE - x.year)][]

   ID       year AH
1:  1 2013-01-01 NA
2:  2 2013-01-01 NA
3:  3 2013-01-01 NA
4:  1 2013-02-01 64
5:  2 2013-02-01 NA
6:  3 2013-02-01 63

请注意,NA表示未找到匹配项.如果需要,可以在adherence[, AH := 0L]进行非等额联接之前初始化AH列.

Note that NA indicates that no match was found. If required, the AH column can be initialised before the non-equi join by adherence[, AH := 0L].

可以简化用于创建样本数据集的代码:

The code to create the sample datasets can be streamlined:

adherence <- data.frame(
  ID = c("1", "2", "3", "1", "2", "3"), 
  year = as.Date(c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01")),
  stringsAsFactors = FALSE)

lsr <- data.frame(
  ID = c("1", "1", "1", "2", "2", "2", "3", "3"),
  eksd = as.Date(c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05")),
  DDD = as.integer(c("60", "90", "90", "60", "120", "60", "30", "90")),
  stringsAsFactors = FALSE)
lsr$ENDDATE <- lsr$eksd + lsr$DDD

请注意,DDD是整数类型,对于数字/双精度类型,通常需要4个字节而不是8个字节.

Note that DDD is of type integer which usually requires 4 bytes instead of 8 bytes for type numeric/double.

还请注意,最后一条语句 可能导致整个数据对象lsr被复制.可以通过使用data.table语法(通过引用更新)来避免这种情况.

Also note that the last statement may cause the whole data object lsr to be copied. This can be avoided by using data.table syntax which updates by reference.

library(data.table)
setDT(lsr)[, ENDDATE := eksd + DDD][]

这篇关于Data.table:如何获得它承诺的惊人的快速子集并将其应用于第二个data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆