将数据帧列表中的变量聚合为单个数据帧 [英] Aggregate variables in list of data frames into single data frame
问题描述
我正在按R执行每个保单人寿保险评估.每个现金保单的每月现金流量预测均以以下格式(例如)返回:
I am performing a per policy life insurance valuation in R. Monthly cash flow projections are performed per policy and returns a data frame in the following format (for example):
Policy1 = data.frame(ProjM = 1:200,
Cashflow1 = rep(5,200),
Cashflow2 = rep(10,200))
我的模型返回一个列表(使用lapply和执行每个策略现金流量预测的功能-基于各种每个策略详细信息,升级假设和生命周期).我想按ProjM
汇总所有保单的现金流量.以下代码实现了我想要的功能,但是正在寻找一种内存效率更高的方式(即,不使用rbindlist
函数).示例数据:
My model returns a list (using lapply and a function which performs the per policy cashflow projection - based on various per policy details, escalation assumptions and life contingencies). I want to aggregate the cash flows across all policies by ProjM
. The following code does what I want, but looking for a more memory efficient way (ie not using the rbindlist
function). Example data:
Policy1 = data.frame(ProjM = 1:5,
Cashflow1 = rep(5,5),
Cashflow2 = rep(10,5))
Policy2 = data.frame(ProjM = 1:3,
Cashflow1 = rep(50,3),
Cashflow2 = rep(-45,3))
# this is the output containing 35000 data frames:
ListOfDataFrames = list(Policy1 = Policy1, Policy2 = Policy2)
我的代码:
library(data.table)
OneBigDataFrame <- rbindlist(ListOfDataFrames)
MyOutput <- aggregate(. ~ ProjM, data = OneBigDataFrame, FUN = sum)
需要的输出:
ProjM Cashflow1 Cashflow2
1 55 -35
2 55 -35
3 55 -35
4 5 10
5 5 10
我正在寻找示例,并且 R数据框聚合列表为所有数据帧,但不要将它们合并为1个数据帧.
I have looked for examples, and R aggregate list of dataframe performs aggregation for all data frames, but do not combine them into 1 data frame.
推荐答案
使用data.table
语法,第一步方法是首先创建大data.table,然后进行聚合:
With data.table
syntax the one step approach would be to create the big data.table first and then do the aggregation:
library(data.table)
OneBigDataFrame <- rbindlist(ListOfDataFrames)
OneBigDataFrame[, lapply(.SD, sum), by = ProjM]
或更简洁
rbindlist(ListOfDataFrames)[, lapply(.SD, sum), by = ProjM]
ProjM Cashflow1 Cashflow2
1: 1 55 -35
2: 2 55 -35
3: 3 55 -35
4: 4 5 10
5: 5 5 10
现在,OP已请求避免首先创建大data.table以节省内存.这需要一种两步方法,其中为每个data.table计算汇总,然后在最后一步将其汇总为总计:
Now, the OP has requested to avoid creating the big data.table first in order to save memory. This requires a two step approach where the aggregates are computed for each data.table which are then aggregated to a grand total in the final step:
rbindlist(
lapply(ListOfDataFrames,
function(x) setDT(x)[, lapply(.SD, sum), by = ProjM])
)[, lapply(.SD, sum), by = ProjM]
ProjM Cashflow1 Cashflow2
1: 1 55 -35
2: 2 55 -35
3: 3 55 -35
4: 4 5 10
5: 5 5 10
请注意,此处使用setDT()
将数据帧通过引用强制转换为数据表 ,即无需创建额外的副本即可节省时间和内存.
Note that setDT()
is used here to coerce the data.frames to data.table by reference, i.e., without creating an additional copy which saves time and memory.
使用基准数据 db的数据(10000个data.frame的列表,每个帧100行,总计28.5 Mb)到目前为止提供的所有答案:
Using the benchmark data of d.b (list of 10000 data.frames with 100 rows each, 28.5 Mb in total) with all answers provided so far:
mb <- microbenchmark::microbenchmark(
malan = {
OneBigDataFrame <- rbindlist(test)
malan <- aggregate(. ~ ProjM, data = OneBigDataFrame, FUN = sum)
},
d.b = d.b <- with(data = data.frame(do.call(dplyr::bind_rows, test)),
expr = aggregate(x = list(Cashflow1 = Cashflow1, Cashflow2 = Cashflow2),
by = list(ProjM = ProjM),
FUN = sum)),
a.gore = {
newagg <- function(dataset) {
dataset <- data.table(dataset)
dataset <- dataset[,lapply(.SD,sum),by=ProjM,.SDcols=c("Cashflow1","Cashflow2")]
return(dataset)
}
a.gore <- newagg(rbindlist(lapply(test,newagg)))
},
dt1 = dt1 <- rbindlist(test)[, lapply(.SD, sum), by = ProjM],
dt2 = dt2 <- rbindlist(
lapply(test,
function(x) setDT(x)[, lapply(.SD, sum), by = ProjM])
)[, lapply(.SD, sum), by = ProjM],
times = 5L
)
mb
Unit: milliseconds
expr min lq mean median uq max neval cld
malan 565.43967 583.08300 631.15898 600.45790 605.60237 801.2120 5 b
d.b 707.50261 710.31127 719.25591 713.54526 721.26691 743.6535 5 b
a.gore 14706.40442 14747.76305 14861.61641 14778.88547 14805.29412 15269.7350 5 d
dt1 40.10061 40.92474 42.27034 41.55434 42.07951 46.6925 5 a
dt2 8806.85039 8846.47519 9144.00399 9295.29432 9319.17251 9452.2275 5 c
最快的解决方案是使用data.table
的一步方法,比第二快的方法快15倍.令人惊讶的是,两步法data.table
的幅度要比一步法慢.
The fastest solution is the one step approach using data.table
which is 15 times faster than the second fastest. Surprisingly, the two step data.table
approaches are magnitudes slower than the one step approach.
要确保所有解决方案都返回相同的结果,可以使用
To make sure that all solutions return the same result this can be checked using
all.equal(malan, d.b)
all.equal(malan, as.data.frame(a.gore))
all.equal(malan, as.data.frame(dt1))
all.equal(malan, as.data.frame(dt2))
在所有情况下均返回TRUE
.
这篇关于将数据帧列表中的变量聚合为单个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!