使用lapply(.SD,...)计算多个变量的多个聚合 [英] Calculate multiple aggregations on several variables using lapply(.SD, ...)

查看:284
本文介绍了使用lapply(.SD,...)计算多个变量的多个聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 data.table lapply(.SD,...)方法,即针对几个变量计算几个不同的摘要统计量。但是我对如何执行此操作的猜测以错误或等效于 rbind 而不是 cbind 结束。 p>

例如,要通过cyl获取mtcar的均值和中位数mpg,可以执行以下操作:

  mtcars.dt<-data.table(mtcars)
mtcars.dt [,list(mpg.mean =平均值(mpg),mpg.median =中位数(mpg)),按= cyl]
#结果:
cyl mpg.mean mpg.median
| 1:6 19.74 19.7
| 2:4 26.66 26.0
| 3:8 15.10 15.2

但是使用 .SD 方法或者 rbind 函数的结果:

  mtcars.dt [,lapply(.SD,function(x)list(mean (x),中位数(x))),
by = cyl,。SDcols = c( mpg)]
#结果:
cyl mpg
1:6 19.7428571428571
2:6 19.7
3:4 26.6636363636364
4:4 26
5:8 15.1
6 :8 15.2

或完全中断:

  mtcars.dt [,lapply(.SD,list(平均值,中位数)),
by = cyl,.SDcols = c( mpg)]
#结果:
#`[.data.table`(mtcars.dt,,lapply(.SD,list(mean,median)))中的错误,:
#尝试应用无功能的

编辑:正如Senor O所指出的,一些答案为我的示例提供了工作,但这仅仅是因为只有一个聚合列。理想的解决方案适用于多列,例如替换以下内容:

  mtcars.dt [,list(mpg.mean = mean(mpg) ,mpg.median =中位数(mpg),
hp.mean =平均值(hp),hp.median =中位数(hp)),按= cyl]
#结果:
cyl mpg.mean mpg.median hp.mean hp.median
1:6 19.74 19.7 122.29 110.0
2:4 26.66 26.0 82.64 91.0
3:8 15.10 15.2 209.21 192.5

但是,即使它适用于单个列,它仍然有用。例如,我的直接用例是一个函数,该函数将列名作为字符串并为其计算多个分组度量,如果没有 .SDcols AFAIK,这是不可能的。

解决方案

您丢失了 [[1]] $ mpg

  mtcars.dt [,lapply(.SD,函数(x)列表(平均值(x),中位数(x)))[[1],
by = cyl,.SDcols = c( mpg)]
#或
mtcars.dt [,lapply(.SD,function(x)list(mean(x),median(x)))$ mpg,
by = cyl,.SDcols = c( mpg )]
#cyl V1 V2
#1:6 19.74286 19.7
#2:4 26.66364 26.0
#3:8 15.10000 15.2

对于更一般的情况,请尝试:

  mtcars.dt [,as.list(unlist(lapply(.SD,function(x)list(mean = mean(x),
middle = median(x)))))),
by = cyl,.SDcols = c( mpg, hp)]
#cyl mpg.mea n mpg.median hp.mean hp.median
#1:6 19.74 19.7 122.29 110.0
#2:4 26.66 26.0 82.64 91.0
#3:8 15.10 15.2 209.21 192.5

(或 as.list(sapply(.SD,...))


I'd like to perform multiple aggregations, using data.table's lapply(.SD, ...) approach, i.e. calculate several different summary statistics on several variables. But my guesses as to how to do this end in either errors or the equivalent of rbind rather than cbind.

For example, to get the mean and median mpg in mtcars by cyl, one could do the following:

mtcars.dt <- data.table(mtcars)
mtcars.dt[, list(mpg.mean = mean(mpg), mpg.median = median(mpg)), by = "cyl"]
# Result:
    cyl mpg.mean mpg.median
|1:   6    19.74       19.7
|2:   4    26.66       26.0
|3:   8    15.10       15.2

But applying the .SD approach either rbinds the result on the functions:

mtcars.dt[, lapply(.SD, function(x) list(mean(x), median(x))),
          by = "cyl", .SDcols = c("mpg")]
# Result:
   cyl              mpg
1:   6 19.7428571428571
2:   6             19.7
3:   4 26.6636363636364
4:   4               26
5:   8             15.1
6:   8             15.2

Or breaks altogether:

mtcars.dt[, lapply(.SD, list(mean, median)),
          by = "cyl", .SDcols = c("mpg")]
# Result:
# Error in `[.data.table`(mtcars.dt, , lapply(.SD, list(mean, median)),  :
#  attempt to apply non-function

EDIT: As Senor O noted, some answers provided work for my example, but only because there's a single aggregation column. An ideal solution would work for multiple columns, for example replacing the following:

mtcars.dt[, list(mpg.mean = mean(mpg), mpg.median = median(mpg), 
                 hp.mean = mean(hp), hp.median = median(hp)), by = "cyl"]
# Result:
   cyl mpg.mean mpg.median hp.mean hp.median
1:   6    19.74       19.7  122.29     110.0
2:   4    26.66       26.0   82.64      91.0
3:   8    15.10       15.2  209.21     192.5

However, even if it works for a single column, it can still be useful. For example, my immediate use case is a function which takes a column name as a string and calculates multiple grouped-by metrics for it, something which is not possible without .SDcols AFAIK.

解决方案

You're missing a [[1]] or $mpg:

mtcars.dt[, lapply(.SD, function(x) list(mean(x), median(x)))[[1]],
            by="cyl", .SDcols=c("mpg")]
#or
mtcars.dt[, lapply(.SD, function(x) list(mean(x), median(x)))$mpg,
            by="cyl", .SDcols=c("mpg")]
#   cyl       V1   V2
#1:   6 19.74286 19.7
#2:   4 26.66364 26.0
#3:   8 15.10000 15.2

For the more general case, try:

mtcars.dt[, as.list(unlist(lapply(.SD, function(x) list(mean=mean(x),
                                                        median=median(x))))),
            by="cyl", .SDcols=c("mpg", "hp")]
#    cyl mpg.mean mpg.median hp.mean hp.median
# 1:   6    19.74       19.7  122.29     110.0
# 2:   4    26.66       26.0   82.64      91.0
# 3:   8    15.10       15.2  209.21     192.5

(or as.list(sapply(.SD, ...)))

这篇关于使用lapply(.SD,...)计算多个变量的多个聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆