R聚合在函数中有多个参数 [英] R aggregate with multiple arguments in function

查看:136
本文介绍了R聚合在函数中有多个参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过在数据框架上使用聚合来避免使用循环的时间。但是我需要一个列的值进入最终计算。

  dat < -  data.frame(key = c('a','b','a','b'),
rate = c(0.5,0.4,1,0.6),
v1 = c(4,0,3 ,1),
v2 = c(2,0,9,4))

> dat
密钥率v1 v2
1 a 0.5 4 2
2 b 0.4 0 0
3 a 1.0 3 9
4 b 0.6 1 4

aggregate(dat [, - 1],list(key = dat $ key )
函数(x,y = dat $ rate){
rate < - as.numeric(y)
values < - as.numeric(x)
return (sum(value * rate)/ sum(rates))
})

注意:这个功能只是一个例子!

这个实现的问题是, y = dat $ rate 给出了dat上的所有4个费率,当我想要只是2个汇总率!
Anny sugestion我该怎么做?
谢谢!

解决方案

这是我设法实现的,使用 data.table 包:

  DT<  -  data.table(dat,key =key 
DT [,list(v1 = sum(rate * v1)/ sum(rate),v2 = sum(rate * v2)/ sum(rate)),by =key]
#key v1 v2
#1:a 3.333333 6.666667
#2:b 0.600000 2.400000

好的。所以这很容易写出两个变量,但是当我们有更多的列时呢。使用 lapply(.SD,...)结合您的功能:



首先,一些数据: / p>

  set.seed(1)
dat< - data.frame(key = rep(c(a ,b),times = 10),
rate = runif(20,min = 0,max = 1),
v1 = sample(10,20,replace = TRUE),
v2 = sample(20,20,replace = TRUE),
v3 = sample(30,20,replace = TRUE),
x1 = sample(5,20,replace = TRUE),
x2 = sample(6:10,20,replace = TRUE),
x3 = sample(11:15,20,replace = TRUE))
库(data.table)
datDT< - data.table(dat,key =key)
datDT
#密钥速率v1 v2 v3 x1 x2 x3
#1:a 0.26550866 10 17 28 3 9 15
#2:a 0.57285336 7 16 14 2 7 13
#3:a 0.20168193 3 11 20 4 9 14
#4:a 0.94467527 1 1 15 4 6 13
#5 :a 0.62911404 9 15 3 2 10 12
#6:a 0.20597457 5 10 11 2 10 13
#7:a 0.68702285 5 9 11 4 7 11
#8:a 0.76984142 9 2 15 4 6 15
#9:a 0.71761851 8 7 26 3 9 13
#10:a 0.38003518 8 14 24 5 8 15
#11:b 0.37212390 3 13 9 4 7 13
#12:b 0.90820779 2 12 10 2 10 11
#13:b 0.89838968 4 16 8 2 7 13
#14:b 0.66079779 4 10 23 1 8 12
#15:b 0.06178627 4 14 27 1 8 13
#16:b 0.17655675 6 18 26 1 9 11
#17:b 0.38410372 2 5 11 5 8 14
#18:b 0.49769924 7 2 27 4 6 13
#19:b 0.99190609 2 11 12 3 6 13
#20:b 0.77744522 5 9 29 4 9 13

二,聚合: / p>

  datDT [,lapply(.SD,function(x,y = rate)sum(y * x)/ sum(y) ),by =key] 
#key rate v1 v2 v3 x1 x2 x3
#1:a 0.6501303 6.335976 8.634691 15.75915 3.363832 7.658762 13.19152
#2:b 0.7375793 3.595585 10.749705 16.26582 2。 792390 7.741787 12.57301

如果您有一个非常大的数据集,您可能需要探索 data.table 一般来说






对于什么是值得的,我也是成功的在基地R,但我不知道这会有多高效,特别是因为转置等等。

  t (i(i,i))中的(i,i,b,b) 1:length(y)){
V1 [i]< - sum(x [2] * x [y [i]])/ sum(x [2])
}

}))
#[,1] [,2] [,3] [,4] [,5] [,6]
#a 6.335976 8.634691 15.75915 3.363832 7.658762 13.19152
#b 3.595585 10.749705 16.26582 2.792390 7.741787 12.57301


Im tryng to avoid a time consuming for loop by using an aggregate on a data.frame. But I need that the values of one of the columns enters in the final computation.

dat <- data.frame(key = c('a', 'b', 'a','b'), 
rate = c(0.5,0.4,1,0.6), 
v1 = c(4,0,3,1), 
v2 = c(2,0,9,4))

>dat
  key rate v1 v2
1   a  0.5  4  2
2   b  0.4  0  0
3   a  1.0  3  9
4   b  0.6  1  4

aggregate(dat[,-1], list(key=dat$key),  
    function(x, y=dat$rate){
        rates <- as.numeric(y)
        values <- as.numeric(x)
        return(sum(values*rates)/sum(rates))
    })

Note: The function is just an example!
The problem of this implementation is that y=dat$rate gives all 4 rates on dat, when what I want is just the 2 aggregated rates! Anny sugestion on how I could do this? Thanks!

解决方案

Here's what I managed to achieve, using the "data.table" package:

DT <- data.table(dat, key = "key")
DT[, list(v1 = sum(rate * v1)/sum(rate), v2 = sum(rate * v2)/sum(rate)), by = "key"]
#    key       v1       v2
# 1:   a 3.333333 6.666667
# 2:   b 0.600000 2.400000

OK. So that's easy to write out for just two variables, but what about when we have a lot more columns. Use lapply(.SD,...) in conjunction with your function:

First, some data:

set.seed(1)
dat <- data.frame(key = rep(c("a", "b"), times = 10),
                  rate = runif(20, min = 0, max = 1),
                  v1 = sample(10, 20, replace = TRUE),
                  v2 = sample(20, 20, replace = TRUE),
                  v3 = sample(30, 20, replace = TRUE),
                  x1 = sample(5, 20, replace = TRUE),
                  x2 = sample(6:10, 20, replace = TRUE),
                  x3 = sample(11:15, 20, replace = TRUE))
library(data.table)
datDT <- data.table(dat, key = "key")
datDT
#     key       rate v1 v2 v3 x1 x2 x3
#  1:   a 0.26550866 10 17 28  3  9 15
#  2:   a 0.57285336  7 16 14  2  7 13
#  3:   a 0.20168193  3 11 20  4  9 14
#  4:   a 0.94467527  1  1 15  4  6 13
#  5:   a 0.62911404  9 15  3  2 10 12
#  6:   a 0.20597457  5 10 11  2 10 13
#  7:   a 0.68702285  5  9 11  4  7 11
#  8:   a 0.76984142  9  2 15  4  6 15
#  9:   a 0.71761851  8  7 26  3  9 13
# 10:   a 0.38003518  8 14 24  5  8 15
# 11:   b 0.37212390  3 13  9  4  7 13
# 12:   b 0.90820779  2 12 10  2 10 11
# 13:   b 0.89838968  4 16  8  2  7 13
# 14:   b 0.66079779  4 10 23  1  8 12
# 15:   b 0.06178627  4 14 27  1  8 13
# 16:   b 0.17655675  6 18 26  1  9 11
# 17:   b 0.38410372  2  5 11  5  8 14
# 18:   b 0.49769924  7  2 27  4  6 13
# 19:   b 0.99190609  2 11 12  3  6 13
# 20:   b 0.77744522  5  9 29  4  9 13

Second, aggregate:

datDT[, lapply(.SD, function(x, y = rate) sum(y * x)/sum(y)), by = "key"]
#    key      rate       v1        v2       v3       x1       x2       x3
# 1:   a 0.6501303 6.335976  8.634691 15.75915 3.363832 7.658762 13.19152
# 2:   b 0.7375793 3.595585 10.749705 16.26582 2.792390 7.741787 12.57301

If you have a really large dataset, you might want to explore data.table in general.


For what it is worth, I was also successful in base R, but I'm not sure how efficient this would be, particularly because of the transposing and so on.

t(sapply(split(dat, dat[1]), 
         function(x, y = 3:ncol(dat)) {
           V1 <- vector()
           for (i in 1:length(y)) {
             V1[i] <- sum(x[2] * x[y[i]])/sum(x[2])
           }
           V1
         }))
#       [,1]      [,2]     [,3]     [,4]     [,5]     [,6]
# a 6.335976  8.634691 15.75915 3.363832 7.658762 13.19152
# b 3.595585 10.749705 16.26582 2.792390 7.741787 12.57301

这篇关于R聚合在函数中有多个参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆