使用 lapply(.SD, ...) 计算多个变量的多个聚合 [英] Calculate multiple aggregations on several variables using lapply(.SD, ...)

查看:13
本文介绍了使用 lapply(.SD, ...) 计算多个变量的多个聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想执行多个聚合,使用 data.tablelapply(.SD, ...) 方法,即计算几个不同的汇总统计信息几个变量.但我对如何做到这一点的猜测以错误或相当于 rbind 而不是 cbind 的方式结束.

I'd like to perform multiple aggregations, using data.table's lapply(.SD, ...) approach, i.e. calculate several different summary statistics on several variables. But my guesses as to how to do this end in either errors or the equivalent of rbind rather than cbind.

例如,要通过 cyl 获得 mtcars 中 mpg 的平均值和中值,可以执行以下操作:

For example, to get the mean and median mpg in mtcars by cyl, one could do the following:

mtcars.dt <- data.table(mtcars)
mtcars.dt[, list(mpg.mean = mean(mpg), mpg.median = median(mpg)), by = "cyl"]
# Result:
    cyl mpg.mean mpg.median
|1:   6    19.74       19.7
|2:   4    26.66       26.0
|3:   8    15.10       15.2

但是应用 .SD 方法或者 rbinds 函数的结果:

But applying the .SD approach either rbinds the result on the functions:

mtcars.dt[, lapply(.SD, function(x) list(mean(x), median(x))),
          by = "cyl", .SDcols = c("mpg")]
# Result:
   cyl              mpg
1:   6 19.7428571428571
2:   6             19.7
3:   4 26.6636363636364
4:   4               26
5:   8             15.1
6:   8             15.2

或完全中断:

mtcars.dt[, lapply(.SD, list(mean, median)),
          by = "cyl", .SDcols = c("mpg")]
# Result:
# Error in `[.data.table`(mtcars.dt, , lapply(.SD, list(mean, median)),  :
#  attempt to apply non-function

正如 Senor O 所指出的,一些答案为我的示例提供了工作,但这只是因为有一个聚合列.理想的解决方案适用于多列,例如替换以下内容:

As Senor O noted, some answers provided work for my example, but only because there's a single aggregation column. An ideal solution would work for multiple columns, for example replacing the following:

mtcars.dt[, list(mpg.mean = mean(mpg), mpg.median = median(mpg), 
                 hp.mean = mean(hp), hp.median = median(hp)), by = "cyl"]
# Result:
   cyl mpg.mean mpg.median hp.mean hp.median
1:   6    19.74       19.7  122.29     110.0
2:   4    26.66       26.0   82.64      91.0
3:   8    15.10       15.2  209.21     192.5

但是,即使它适用于单个列,它仍然很有用.例如,我的直接用例是一个将列名作为字符串并为其计算多个分组指标的函数,如果没有 .SDcols AFAIK,这是不可能的.

However, even if it works for a single column, it can still be useful. For example, my immediate use case is a function which takes a column name as a string and calculates multiple grouped-by metrics for it, something which is not possible without .SDcols AFAIK.

推荐答案

你缺少 [[1]]$mpg:

mtcars.dt[, lapply(.SD, function(x) list(mean(x), median(x)))[[1]],
            by="cyl", .SDcols=c("mpg")]
#or
mtcars.dt[, lapply(.SD, function(x) list(mean(x), median(x)))$mpg,
            by="cyl", .SDcols=c("mpg")]
#   cyl       V1   V2
#1:   6 19.74286 19.7
#2:   4 26.66364 26.0
#3:   8 15.10000 15.2

对于更一般的情况,请尝试:

For the more general case, try:

mtcars.dt[, as.list(unlist(lapply(.SD, function(x) list(mean=mean(x),
                                                        median=median(x))))),
            by="cyl", .SDcols=c("mpg", "hp")]
#    cyl mpg.mean mpg.median hp.mean hp.median
# 1:   6    19.74       19.7  122.29     110.0
# 2:   4    26.66       26.0   82.64      91.0
# 3:   8    15.10       15.2  209.21     192.5

(或 as.list(sapply(.SD, ...)))

这篇关于使用 lapply(.SD, ...) 计算多个变量的多个聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆