Data.table:如何获得它承诺的极快的子集并应用于第二个 data.table [英] Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table

查看:21
本文介绍了Data.table:如何获得它承诺的极快的子集并应用于第二个 data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试基于另一个数据集 (lsr) 的子集来丰富一个数据集(依从性).对于依从性中的每一行,我想计算(作为第三列)可用于实施规定方案的药物.我有一个返回相关结果的函数,但它仅在我必须运行的总数据的一个子集上运行数天.

数据集是:

 库(dplyr)图书馆(整理)图书馆(润滑)图书馆(数据表)坚持 <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01"、"2013-01-01"、"2013-02-01"、"2013-02-01"、"2013-02-01"))名称(依从性)[1] <-ID"名称(坚持)[2] <-年"坚持$年<- ymd(坚持$年)lsr <- cbind.data.frame(c("1", "1", "1", "2", "2", "2", "3", "3"), #IDc("2012-03-01"、"2012-08-02"、"2013-01-06"、"2012-08-25"、"2013-03-22"、"2013-09-15"、2011-01-01"、2013-01-05")、#eksdc("60", "90", "90", "60", "120", "60", "30", "90") # DDD)名称(lsr)[1] <-ID"名称(lsr)[2] <-eksd"名称(lsr)[3] <-DDD"lsr$eksd <- as.Date((lsr$eksd))lsr$DDD <- as.numeric(as.character(lsr$DDD))lsr$ENDDATE <- lsr$eksd + lsr$DDDlsr <- as.data.table(lsr)坚持<- as.data.table(坚持)

我习惯使用 dplyr,但它慢得多,我为 data.table 重写了一些东西来尝试一下.我与 SAS 一起工作的同事声称这对他们来说不会花很长时间,这让我发疯,因为我需要花费数小时才能将数据本身加载到 RAM 中.(fread 在我的几个数据集上使 R 崩溃).依从性是 1.5 mio 行,而 lsr 是几百 mio.行.

我的工作职能是

function.AH <- function(x) {lsr[ID == x[1] &eksd <= x[2] &结束日期 >x[2], ifelse(.N == 0, 0, sum(as.numeric(ENDDATE - as.Date(x[2]))))]}setkey(lsr, ID, eksd, ENDDATE)坚持$AH <-apply(坚持,1,FUN = function.AH)#DESIRED OUTPUT

我不知道最好的方法:我已经研究过使用 SQL 数据库,但据我所知,当我的数据适合 RAM(我有 256GB)时,这应该不会更快.由于依从性数据表实际上是每个单独的 ID 重复了 500 个时间段(即 ID 1:在时间 1、时间 2、时间 3...时间 500、ID 2:在时间 1、时间 2...等)我还考虑在 lsr 上的 ID 上使用 by 函数,以及如何将这个时间间隔 (1:500) 嵌入到 j 的函数中.

我希望有人能指出我如何通过不以某种方式将 apply 函数应用到 data.table-framework 中而低效地使用它,从而失去了构建效率.但是,由于我将处理这些数据和类似大小的数据,因此我很感激您提供任何有关更快解决此问题的具体建议或使用其他方法获得更快运行时间的一般建议.

解决方案

这可以通过在非对等连接中更新来解决.

这避免了由笛卡尔连接或调用 apply() 引起的内存问题,后者将 data.frame 或 data.table 强制转换为涉及复制数据的矩阵.

另外,OP 提到 lsr 有几百个 mio.rows 和 adherence 有 1.5 mio 行(500 个时间段乘以 3000 个 ID's).因此,有效存储数据项不仅可以减少内存占用,还可以减少加载数据所需的处理时间.

library(data.table)# 通过引用强制到data.table,即不复制setDT(坚持)设置DT(lsr)# 强制使用 IDate 以节省内存坚持[,年:= as.IDate(年)]cols <- c("eksd", "ENDDATE")lsr[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]# 在非对等连接中更新坚持[lsr, on = .(ID, year >= eksd, year 

<块引用>

 ID 年份 AH1:1 2013-01-01 不适用2:2 2013-01-01 不适用3:3 2013-01-01 不适用4:1 2013-02-01 645:2 2013-02-01 不适用6:3 2013-02-01 63

请注意,NA 表示未找到匹配项.如果需要,AH 列可以在非对等连接之前通过 adherence[, AH := 0L] 进行初始化.

数据

可以简化创建示例数据集的代码:

坚持 <- data.frame(ID = c(1"、2"、3"、1"、2"、3")、year = as.Date(c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01"、2013-02-01"))、stringsAsFactors = FALSE)lsr <- 数据框(ID = c(1"、1"、1"、2"、2"、2"、3"、3")、eksd = as.Date(c(2012-03-01"、2012-08-02"、2013-01-06"、2012-08-25"、2013-03-"22"、2013-09-15"、2011-01-01"、2013-01-05"))、DDD = as.integer(c("60", "90", "90", "60", "120", "60", "30", "90")),stringsAsFactors = FALSE)lsr$ENDDATE <- lsr$eksd + lsr$DDD

请注意,DDD 是 integer 类型,通常需要 4 个字节,而不是 numeric/double 类型的 8 个字节.

还要注意最后一条语句可能导致整个数据对象lsr被复制.这可以通过使用通过引用更新的 data.table 语法来避免.

library(data.table)setDT(lsr)[, ENDDATE := eksd + DDD][]

I'm trying to enrich one dataset (adherence) based on subsets from another (lsr). For each individual row in adherence, I want to calculate (as a third column) the medication available for implementing the prescribed regimen. I have a function that returns the relevant result, but it runs for days on just a subset of the total data I have to run it on.

The datasets are:

 library(dplyr)
library(tidyr)
library(lubridate)
library(data.table)

adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01"))
names(adherence)[1] <- "ID" 
names(adherence)[2] <- "year"
adherence$year <- ymd(adherence$year)

lsr <- cbind.data.frame(
  c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
  c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05"), #eksd
  c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"

lsr$eksd <- as.Date((lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD
lsr <- as.data.table(lsr)

adherence <- as.data.table(adherence)

I'm used to working with dplyr, but it was much slower and I rewrote things for data.table to try it out. It is driving me crazy that my colleagues working with SAS claims that this wouldn't take long for them, when it takes me hours just to load the data itself into RAM. (fread crashes R on several of my datasets). Adherence is 1,5 mio rows, and lsr is a few hundred mio. rows.

My working function is

function.AH <- function(x) {
  lsr[ID == x[1] & eksd <= x[2] & ENDDATE > x[2], ifelse(.N == 0, 0, sum(as.numeric(ENDDATE - as.Date(x[2]))))]
}
setkey(lsr, ID, eksd, ENDDATE)
adherence$AH <-apply (adherence, 1,  FUN = function.AH) #DESIRED OUTPUT

I don't know the best approach: I've looked into using a SQL database, but as I understand it this shouldn't be faster when my data fits into RAM (I have 256GB). Since the adherence data.table is actually each individual ID repeated for 500 timeperiods (i.e. ID 1: at time 1, time 2, time 3...time 500, ID 2: at time 1, time 2... etc.)I also considered using the by function on ID on lsr and some how imbedding this time interval (1:500) in the function in j.

I hope that some-one can point out how I'm using the apply function inefficiently by not somehow applying it inside the data.table-framework and thus loosing the build in efficiency. But as I'm going to be working with this data and similar sizes of data, I'd appreciate any specific suggestions for solving this faster or general suggestions for getting faster running times using other methods.

解决方案

This can be solved by updating in a non-equi join.

This avoids the memory issues caused by a cartesian join or by calling apply() which coerces a data.frame or data.table to a matrix which involves copying the data.

In addition, the OP has mentioned that lsr has a few hundred mio. rows and adherence has 1.5 mio rows (500 timeperiods times 3000 ID's). Therefore, efficient storage of data items will not only reduce the memory footprint but may also reduce the share of processing time which is required for loading data.

library(data.table)
# coerce to data.table by reference, i.e., without copying
setDT(adherence)
setDT(lsr)
# coerce to IDate to save memory
adherence[, year := as.IDate(year)]
cols <- c("eksd", "ENDDATE")
lsr[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# update in a non-equi join
adherence[lsr, on = .(ID, year >= eksd, year < ENDDATE), 
                      AH := as.integer(ENDDATE - x.year)][]

   ID       year AH
1:  1 2013-01-01 NA
2:  2 2013-01-01 NA
3:  3 2013-01-01 NA
4:  1 2013-02-01 64
5:  2 2013-02-01 NA
6:  3 2013-02-01 63

Note that NA indicates that no match was found. If required, the AH column can be initialised before the non-equi join by adherence[, AH := 0L].

Data

The code to create the sample datasets can be streamlined:

adherence <- data.frame(
  ID = c("1", "2", "3", "1", "2", "3"), 
  year = as.Date(c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01")),
  stringsAsFactors = FALSE)

lsr <- data.frame(
  ID = c("1", "1", "1", "2", "2", "2", "3", "3"),
  eksd = as.Date(c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05")),
  DDD = as.integer(c("60", "90", "90", "60", "120", "60", "30", "90")),
  stringsAsFactors = FALSE)
lsr$ENDDATE <- lsr$eksd + lsr$DDD

Note that DDD is of type integer which usually requires 4 bytes instead of 8 bytes for type numeric/double.

Also note that the last statement may cause the whole data object lsr to be copied. This can be avoided by using data.table syntax which updates by reference.

library(data.table)
setDT(lsr)[, ENDDATE := eksd + DDD][]

这篇关于Data.table:如何获得它承诺的极快的子集并应用于第二个 data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆