用R中的不同列值求和 [英] Sum by distinct column value in R

查看:1801
本文介绍了用R中的不同列值求和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中有一个非常大的数据框架,并希望对其他列中的每个不同值求和两列,例如我们有一天内各种商店的交易数据帧的数据如下

  shop<  -  data.frame('shop_id'= c(1,1,1,2,3,3),
'shop_name'= c('Shop A','Shop A','Shop A','Shop B','Shop C','Shop C'),
'city'= c ,'London','London','Cardiff','Dublin','Dublin'),
'sale'= c(12,5,9,15,10,18),
'利润'= c(3,1,3,6,5,9))

  shop_id shop_name city sale profit 
1 Shop A London 12 3
1 Shop A London 5 1
1购物A伦敦9 3
2购物B Cardiff 15 6
3购物C都柏林10 5
3购物C都柏林18 9

我想总结每个商店的销售额和利润:

  shop_id shop_name城市销售利润
1购物A London 26 7
2购物B Cardiff 15 6
3购物C Dublin 28 14



我目前正在使用以下代码:

  shop_day <-ddply(shop,shop_id,transform,sale = sum(sale),profit = sum(profit))
shop_day< - subset (shop_id))

这个工作绝对正常,但是我说我的数据框很大(140,000行, 37列和接近100,000个唯一的行,我想总结),我的代码需要运行一段时间,然后最终说它已经用完了内存。



有谁知道

解决方案



**必填数据表回答**

 > library(data.table)
data.table 1.8.0对于帮助类型:help(data.table)
> shop.dt< - data.table(shop)
> shop.dt [,list(sale = sum(sale),profit = sum(profit)),by ='shop_id']
shop_id sale profit
[1,] 1 26 7
[2,] 2 15 6
[3,] 3 28 14
>

这听起来不错,直到事情变大...

  shop<  -  data.frame(shop_id = letters [1:10],profit = rnorm(1e7),sale = rnorm(1e7))
shop.dt< - data.table(shop)

> system.time(ddply(shop,。(shop_id),summarize,sale = sum(sale),profit = sum(profit))
用户系统已过
4.156 1.324 5.514
> system.time(shop.dt [,list(sale = sum(sale),profit = sum(profit)),by ='shop_id'])
用户系统已过
0.728 0.108 0.840

如果使用键创建data.table,则会获得额外的速度提升:

  shop.dt<  -  data.table(shop,key ='shop_id')

& system.time(shop.dt [,list(sale = sum(sale),profit = sum(profit)),by ='shop_id'])
用户系统已过
0.252 0.084 0.336


I have a very large dataframe in R and would like to sum two columns for every distinct value in other columns, for example say we had data of a dataframe of transactions in various shops over a day as follows

shop <- data.frame('shop_id' = c(1, 1, 1, 2, 3, 3), 
  'shop_name' = c('Shop A', 'Shop A', 'Shop A', 'Shop B', 'Shop C', 'Shop C'), 
  'city' = c('London', 'London', 'London', 'Cardiff', 'Dublin', 'Dublin'), 
  'sale' = c(12, 5, 9, 15, 10, 18), 
  'profit' = c(3, 1, 3, 6, 5, 9))

which is:

shop_id  shop_name    city      sale profit
   1     Shop A       London    12   3
   1     Shop A       London    5    1
   1     Shop A       London    9    3
   2     Shop B       Cardiff   15   6
   3     Shop C       Dublin    10   5
   3     Shop C       Dublin    18   9

And I'd want to sum the sale and profit for each shop to give:

shop_id  shop_name    city      sale profit
   1     Shop A       London    26   7
   2     Shop B       Cardiff   15   6
   3     Shop C       Dublin    28   14

I am currently using the following code to do this:

 shop_day <-ddply(shop, "shop_id", transform, sale=sum(sale), profit=sum(profit))
 shop_day <- subset(shop_day, !duplicated(shop_id))

which works absolutely fine, but as I said my dataframe is large (140,000 rows, 37 columns and nearly 100,000 unique rows which I want to sum) and my code takes ages to run and then eventually says it has run out of memory.

Does anyone know of the most efficient way to do this.

Thanks in advance!

解决方案

** Obligatory Data Table answer **

> library(data.table)
data.table 1.8.0  For help type: help("data.table")
> shop.dt <- data.table(shop)
> shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id']
     shop_id sale profit
[1,]       1   26      7
[2,]       2   15      6
[3,]       3   28     14
> 

Which sounds fine and good until things get bigger...

shop <- data.frame(shop_id = letters[1:10], profit=rnorm(1e7), sale=rnorm(1e7))
shop.dt <- data.table(shop)

> system.time(ddply(shop, .(shop_id), summarise, sale=sum(sale), profit=sum(profit)))
   user  system elapsed 
  4.156   1.324   5.514 
> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
   user  system elapsed 
  0.728   0.108   0.840 
> 

You get additional speed increases if you create the data.table with a key:

shop.dt <- data.table(shop, key='shop_id')

> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
   user  system elapsed 
  0.252   0.084   0.336 
> 

这篇关于用R中的不同列值求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆