通过获取汇总重复行 [英] Aggregating duplicate rows by taking sum

查看:83
本文介绍了通过获取汇总重复行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

继续我的问题:

1. 标识一组变量是否唯一标识每行的数据;

2. 根据给定的集合标记所有重复的行变量,

现在,我想通过取一个给定的变量集合来汇总/合并所有重复的行。



解决方案1:



有关如何执行此操作的指导这里,但是当有大量的变量形成索引时, ddply 方法推荐的是缓慢的,因为它在case我试图标记所有重复的给定的一组变量。

 #(f1,f2,f3,f4)的值唯一标识观察值
dfUnique = expand.grid = factor(1:16),
f2 = factor(1:41),
f3 = factor(1:2),
f4 = factor(1:104) b
#抽取一些额外的行并绑定它们
dfDup = rbind(dfUnique,dfUnique [sample(1:nrow(dfUnique),100),])


dfDup $ data = rnorm(nrow(dfDup))

#通过取总和
聚合重复行dfDupAgg = ddply(dfDup,。(f1,f2,f3,解决方案2: $ b(f4),summarize,data = sum(data))



$ b

第二个解决方案是使用 data.table ,并遵循这里,我可以做

 #data.table solution 
indexVars = paste0('f',1:4,sep ='')
dtDup = data.table(dfDup,key = indexVars)
dtDupAgg = dtDup [,list(data = sum(data) by = key(dtDup)]

我有几个问题:

1.是否有办法使 ddply 版本更快?

2. data.table 正确?我想检查,因为我是新的 data.table

解决方案

关于您的 data.table 解决方案,您不需要为聚合操作设置键。您可以直接执行:

  indexVars = paste0('f',1:4,sep ='')
dtDup < - as.data.table(dfDup)##快于data.table(。)
dtDupAgg = dtDup [,list(data = sum(data)),by = c(indexVars)]

data.table 允许将 data.frames 转换为 data.tables 的函数 setDT $ c> 通过引用(这意味着,没有副本,因此几乎没有时间在转换,特别是对大数据。



所以,不要这样做:

  dtDup < .data.table(dfDup)
dtDup [...]

  ## data.table v1.9.2 + 
setDT(dfDup)##比as.data.table (。)
dfDup [...] ## dfDup现在是一个data.table,通过引用转换






在第一个问题上, plyr 的速度不为人所知。检查为什么plyr这么慢?(以及其中的许多信息性评论)了解更多信息



也许您可能对 dplyr 感兴趣,这比 plyr ,但仍然慢于 data.table ,IMHO。这是等效的 dplyr 版本:

  dfDup%。%group_by ,f2,f3,f4)%。%summarize(data = sum(data))





$ b

这是在数据上的 data.table dplyr 是最少三次连续运行):

  ## data.table v1.9.2 + 
system.time < - dtDup [,list(data = sum(data)),by = c(indexVars)])
#用户系统已过
#0.049 0.009 0.057

## dplyr(从github提交〜1360)
system.time(ans2 < - dfDup%。%group_by(f1,f2,f3,f4)%。%summarize(data = sum(data)))
#user system elapsed
#0.374 0.013 0.389

我真的没有耐心运行 plyr 版本(第一次运行93秒后停止)。你可以看到 dplyr plyr 要快得多,但是比code>数据慢7倍.table 此处。






检查结果是否相等:

  all.equal(as.data.frame(ans1 [order(f1,f2,f3,f4)]),
as.data.frame(ans2))
#[1] TRUE



HTH


Following on from my questions:
1. Identifying whether a set of variables uniquely identifies each row of the data or not;
2. Tagging all rows that are duplicates in terms of a given set of variables,
I would now like to aggregate/consolidate all the duplicate rows in terms of a given set of variables, by taking their sum.

Solution 1:

There is some guidance on how to do this here, but when there are a large number of levels of the variables that form the index, the ddply method recommended there is slow, as it was in the case where I was trying to tag all the duplicates by a given set of variables.

# Values of (f1, f2, f3, f4) uniquely identify observations
dfUnique = expand.grid(f1 = factor(1:16),
                       f2 = factor(1:41),
                       f3 = factor(1:2),
                       f4 = factor(1:104))

# sample some extra rows and rbind them
dfDup = rbind(dfUnique, dfUnique[sample(1:nrow(dfUnique), 100), ])

# dummy data 
dfDup$data = rnorm(nrow(dfDup))

# aggregate the duplicate rows by taking the sum
dfDupAgg = ddply(dfDup, .(f1, f2, f3, f4), summarise, data = sum(data))

Solution 2:

The second solution is to use data.table, and following the advice here, I could do

# data.table solution
indexVars = paste0('f', 1:4, sep = '')
dtDup = data.table(dfDup, key = indexVars)
dtDupAgg = dtDup[, list(data = sum(data)), by = key(dtDup)]

I have a couple of questions:
1. Is there a way to make the ddply version faster?
2. Is the data.table correct? I want to check since I am new to data.table.

解决方案

Regarding your data.table solution, you don't need to set key for aggregation operations. You can directly do:

indexVars = paste0('f', 1:4, sep = '')
dtDup <- as.data.table(dfDup) ## faster than data.table(.)
dtDupAgg = dtDup[, list(data = sum(data)), by = c(indexVars)]

data.table version 1.9.2+ also implements a function setDT that enables conversion of data.frames to data.tables by reference (which means, there is no copy and therefore takes almost no time in the conversion, especially useful on large data.frames).

So, instead of doing:

dtDup <- as.data.table(dfDup)
dtDup[...]

You could do:

## data.table v1.9.2+
setDT(dfDup) ## faster than as.data.table(.)
dfDup[...]   ## dfDup is now a data.table, converted by reference


On your first question, plyr is not known for its speed. Check Why is plyr so slow? (and the many informative comments there) for more info.

Perhaps you maybe interested in dplyr, which is orders of magnitude faster than plyr, but still slower than data.table, IMHO. Here's the equivalent dplyr version:

dfDup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data))


Here's a benchmark between data.table and dplyr on the data (all timings are minimum of three consecutive runs):

## data.table v1.9.2+
system.time(ans1 <- dtDup[, list(data=sum(data)), by=c(indexVars)])
#  user  system elapsed 
# 0.049   0.009   0.057 

## dplyr (commit ~1360 from github)
system.time(ans2 <- dfDup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data)))
#  user  system elapsed 
# 0.374   0.013   0.389 

I really don't have the patience to run the plyr version (stopped after 93 seconds of first run). As you can see dplyr is much faster than plyr, but ~7x times slower than data.table here.


Check if the results are equal to be sure:

all.equal(as.data.frame(ans1[order(f1,f2,f3,f4)]), 
          as.data.frame(ans2))
# [1] TRUE

HTH

这篇关于通过获取汇总重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆