将数据框列表中的变量聚合为单个数据框 [英] Aggregate variables in list of data frames into single data frame

查看:18
本文介绍了将数据框列表中的变量聚合为单个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 中执行每个保单的人寿保险估值.每月现金流量预测是按保单执行的,并返回以下格式的数据框(例如):

Policy1 = data.frame(ProjM = 1:200,现金流量1 = 代表(5,200),现金流量2 = 代表(10,200))

我的模型返回一个列表(使用 lapply 和一个执行每个保单现金流预测的函数 - 基于每个保单的各种细节、升级假设和生活意外事件).我想通过 ProjM 汇总所有保单的现金流.下面的代码做我想要的,但寻找一种更节省内存的方式(即不使用 rbindlist 函数).示例数据:

Policy1 = data.frame(ProjM = 1:5,现金流量1 = 代表(5,5),现金流量2 = rep(10,5))Policy2 = data.frame(ProjM = 1:3,Cashflow1 = rep(50,3),现金流量2 = rep(-45,3))# 这是包含 35000 个数据帧的输出:ListOfDataFrames = 列表(Policy1 = Policy1,Policy2 = Policy2)

我的代码:

库(data.table)OneBigDataFrame <- rbindlist(ListOfDataFrames)MyOutput <- 聚合(.~ ProjM,数据 = OneBigDataFrame,FUN = 总和)

需要输出:

ProjM Cashflow1 Cashflow21 55 -352 55 -353 55 -354 5 105 5 10

我已经寻找示例,R 数据帧聚合列表 执行聚合所有数据帧,但不要将它们合并为 1 个数据帧.

解决方案

使用 data.table 语法,一步法是先创建大 data.table,然后进行聚合:

库(data.table)OneBigDataFrame <- rbindlist(ListOfDataFrames)OneBigDataFrame[, lapply(.SD, sum), by = ProjM]

或者,更简洁

rbindlist(ListOfDataFrames)[, lapply(.SD, sum), by = ProjM]

<块引用>

 ProjM Cashflow1 Cashflow21:1 55 -352:2 55 -353:3 55 -354:4 5 105:5 5 10

现在,OP 已请求避免首先创建大 data.table 以节省内存.这需要两步方法,其中为每个 data.table 计算聚合,然后在最后一步聚合为总计:

rbindlist(lapply(ListOfDataFrames,函数(x) setDT(x)[, lapply(.SD, sum), by = ProjM]))[, lapply(.SD, sum), by = ProjM]

<块引用>

 ProjM Cashflow1 Cashflow21:1 55 -352:2 55 -353:3 55 -354:4 5 105:5 5 10

请注意,这里使用 setDT() 将 data.frames 强制转换为 data.table 通过引用,即无需创建额外的副本,从而节省时间和内存.

基准测试

使用 of db 的基准数据(10000 个 data.frames 列表,每个 100 行,总共 28.5 Mb)到目前为止提供的所有答案:

mb <- microbenchmark::microbenchmark(马兰 = {OneBigDataFrame <- rbindlist(test)马兰<-聚合(.〜ProjM,数据= OneBigDataFrame,乐趣=总和)},d.b = d.b <- with(data = data.frame(do.call(dplyr::bind_rows, test)),expr = 聚合(x = 列表(Cashflow1 = Cashflow1,Cashflow2 = Cashflow2),通过 = 列表(ProjM = ProjM),FUN =总和)),戈尔= {newagg <- 函数(数据集){数据集 <- data.table(数据集)数据集 <- 数据集[,lapply(.SD,sum),by=ProjM,.SDcols=c("Cashflow1","Cashflow2")]返回(数据集)}a.gore <- newagg(rbindlist(lapply(test,newagg)))},dt1 = dt1 <- rbindlist(test)[, lapply(.SD, sum), by = ProjM],dt2 = dt2 <- rbindlist(lapply(测试,函数(x) setDT(x)[, lapply(.SD, sum), by = ProjM]))[, lapply(.SD, sum), by = ProjM],次 = 5L)mb

<块引用>

单位:毫秒expr min lq mean 中位数 uq max neval cld马兰 565.43967 583.08300 631.15898 600.45790 605.60237 801.2120 5bd.b 707.50261 710.31127 719.25591 713.54526 721.26691 743.6535 5 ba.gore 14706.40442 14747.76305 14861.61641 14778.88547 14805.29412 15269.7350 5 ddt1 40.10061 40.92474 42.27034 41.55434 42.07951 46.6925 5 个dt2 8806.85039 8846.47519 9144.00399 9295.29432 9319.17251 9452.2275 5 c

最快的解决方案是使用 data.table 的一步法,比第二快的方法快 15 倍.令人惊讶的是,两步 data.table 方法比一步方法慢很多.

为了确保所有解决方案返回相同的结果,可以使用以下方法进行检查

all.equal(malan, d.b)all.equal(马兰,as.data.frame(a.gore))all.equal(马兰,as.data.frame(dt1))all.equal(马兰,as.data.frame(dt2))

在所有情况下都返回 TRUE.

I am performing a per policy life insurance valuation in R. Monthly cash flow projections are performed per policy and returns a data frame in the following format (for example):

Policy1 = data.frame(ProjM = 1:200,
                     Cashflow1 = rep(5,200),
                     Cashflow2 = rep(10,200))

My model returns a list (using lapply and a function which performs the per policy cashflow projection - based on various per policy details, escalation assumptions and life contingencies). I want to aggregate the cash flows across all policies by ProjM. The following code does what I want, but looking for a more memory efficient way (ie not using the rbindlist function). Example data:

Policy1 = data.frame(ProjM = 1:5,
                     Cashflow1 = rep(5,5),
                     Cashflow2 = rep(10,5))

Policy2 = data.frame(ProjM = 1:3,
                     Cashflow1 = rep(50,3),
                     Cashflow2 = rep(-45,3))

# this is the output containing 35000 data frames:
ListOfDataFrames = list(Policy1 = Policy1, Policy2 = Policy2)

My code:

library(data.table)
OneBigDataFrame <- rbindlist(ListOfDataFrames)
MyOutput <- aggregate(. ~ ProjM, data = OneBigDataFrame, FUN = sum)

Output required:

ProjM Cashflow1 Cashflow2
 1        55       -35
 2        55       -35
 3        55       -35
 4         5        10
 5         5        10

I have looked for examples, and R aggregate list of dataframe performs aggregation for all data frames, but do not combine them into 1 data frame.

解决方案

With data.table syntax the one step approach would be to create the big data.table first and then do the aggregation:

library(data.table)
OneBigDataFrame <- rbindlist(ListOfDataFrames)
OneBigDataFrame[, lapply(.SD, sum), by = ProjM]

or, more concise

rbindlist(ListOfDataFrames)[, lapply(.SD, sum), by = ProjM]

   ProjM Cashflow1 Cashflow2
1:     1        55       -35
2:     2        55       -35
3:     3        55       -35
4:     4         5        10
5:     5         5        10

Now, the OP has requested to avoid creating the big data.table first in order to save memory. This requires a two step approach where the aggregates are computed for each data.table which are then aggregated to a grand total in the final step:

rbindlist(
  lapply(ListOfDataFrames, 
         function(x) setDT(x)[, lapply(.SD, sum), by = ProjM])
  )[, lapply(.SD, sum), by = ProjM]

   ProjM Cashflow1 Cashflow2
1:     1        55       -35
2:     2        55       -35
3:     3        55       -35
4:     4         5        10
5:     5         5        10

Note that setDT() is used here to coerce the data.frames to data.table by reference, i.e., without creating an additional copy which saves time and memory.

Benchmark

Using the benchmark data of d.b (list of 10000 data.frames with 100 rows each, 28.5 Mb in total) with all answers provided so far:

mb <- microbenchmark::microbenchmark(
  malan = {
    OneBigDataFrame <- rbindlist(test)
    malan <- aggregate(. ~ ProjM, data = OneBigDataFrame, FUN = sum)
  },
  d.b = d.b <- with(data = data.frame(do.call(dplyr::bind_rows, test)),
             expr = aggregate(x = list(Cashflow1 = Cashflow1, Cashflow2 = Cashflow2),
                              by = list(ProjM = ProjM),
                              FUN = sum)),
  a.gore = {
    newagg <- function(dataset) { 
      dataset <- data.table(dataset)
      dataset <- dataset[,lapply(.SD,sum),by=ProjM,.SDcols=c("Cashflow1","Cashflow2")]
      return(dataset)
    }
    a.gore <- newagg(rbindlist(lapply(test,newagg)))
  },
  dt1 = dt1 <- rbindlist(test)[, lapply(.SD, sum), by = ProjM],
  dt2 = dt2 <- rbindlist(
    lapply(test, 
           function(x) setDT(x)[, lapply(.SD, sum), by = ProjM])
  )[, lapply(.SD, sum), by = ProjM],
  times = 5L
)
mb

Unit: milliseconds
   expr         min          lq        mean      median          uq        max neval  cld
  malan   565.43967   583.08300   631.15898   600.45790   605.60237   801.2120     5  b  
    d.b   707.50261   710.31127   719.25591   713.54526   721.26691   743.6535     5  b  
 a.gore 14706.40442 14747.76305 14861.61641 14778.88547 14805.29412 15269.7350     5    d
    dt1    40.10061    40.92474    42.27034    41.55434    42.07951    46.6925     5 a   
    dt2  8806.85039  8846.47519  9144.00399  9295.29432  9319.17251  9452.2275     5   c

The fastest solution is the one step approach using data.table which is 15 times faster than the second fastest. Surprisingly, the two step data.table approaches are magnitudes slower than the one step approach.

To make sure that all solutions return the same result this can be checked using

all.equal(malan, d.b)
all.equal(malan, as.data.frame(a.gore))
all.equal(malan, as.data.frame(dt1))
all.equal(malan, as.data.frame(dt2))

which return TRUE in all cases.

这篇关于将数据框列表中的变量聚合为单个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆