根据键在数据框中汇总值 [英] Aggregating values in a data frame based on key

查看:39
本文介绍了根据键在数据框中汇总值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一段聚合代码,可以很好地运行,但是对具有10e6行的数据帧运行速度有点慢.我对R不那么有经验,所以我为我值得的代码深表歉意!

I've got a piece of aggregation code that works well enough but runs a bit slow against a data frame with 10e6 rows. I'm not that experienced in R so apologies for my cringe worthy code!

我只想对公共密钥进行基本汇总和取值之和...

I just want to do a basic roll up and sum of values for a common key...

例如去...

  key val
1   a   5
2   b   7
3   a   6

到...

  key val
1   a   11
2   b   7

我能管理的最好的是...

the best i can manage is...

keys = unique(inp$key)
vals = sapply(keys, function(x) { sum(inp[inp$key==x,]$val) })
out = data.frame(key=keys, val=vals)

我有种直觉,认为 inp [inp $ key == x,] 不是最好的方法.我缺少明显的速度吗?我可以在Hadoop中做到这一点(因为10e6数据集实际上已经是2e9行数据集的汇总),但是我正在尝试提高R.

I have this gut feel that the inp[inp$key==x,] is not the best way. Is there an obvious speed up i'm missing? I can do it in Hadoop (since the 10e6 dataset is actually already a rollup from a 2e9 row dataset) but I'm trying to improve my R.

干杯,垫子

推荐答案

使用 tapply 的另一个选项:

dat <- data.frame(key = c('a', 'b', 'a'), val = c(5,7,6))

> with(dat, tapply(val, key, FUN = sum))
 a  b 
11  7

我的测试表明这是进行此特定运动最快的方法,显然您的里程可能会有所不同:

My tests indicate this is the fastest method for this particular exercise, obviously your miles may vary:

fn.tapply <- function(daters) with(daters, tapply(val, key, FUN = sum))
fn.aggregate <- function(daters) aggregate(val~key, sum, data = daters)
fn.ddply <- function(daters) ddply(daters, .(key), summarize, val = sum(val))


library(rbenchmark)

benchmark(fn.tapply(dat), fn.aggregate(dat), fn.ddply(dat)
          , columns = c("test", "elapsed", "relative")
          , order = "relative"
          , replications = 100
          )


               test elapsed  relative
1    fn.tapply(dat)    0.03  1.000000
2 fn.aggregate(dat)    0.20  6.666667
3     fn.ddply(dat)    0.30 10.000000

请注意,将真正的苹果与前两个苹果相比,将 tapply 解决方案转换为data.frame可以将这种差异减少约40%.

Note that converting the tapply solution into a data.frame cut this difference in half by ~40% for a true apples to apples comparison to the first two.

使用注释中指示的1M行数据集似乎确实会有所改变:

Using a 1M row dataset as indicated in the comments does seem to change things a bit:

 dat2 <- data.frame(key = rep(letters[1:5], each = 200000), val = runif(1e6))
> benchmark(fn.tapply(dat2), fn.aggregate(dat2), fn.ddply(dat2)
+           , columns = c("test", "elapsed", "relative")
+           , order = "relative"
+           , replications = 100
+           )
               test elapsed relative
1   fn.tapply(dat2)  39.114 1.000000
3     fn.ddply(dat2)  62.178 1.589661
2 fn.aggregate(dat2) 157.463 4.025745

这篇关于根据键在数据框中汇总值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆