在列的子集 (.SDcols) 上应用函数,同时在另一列(组内)上应用不同的函数 [英] Apply function on a subset of columns (.SDcols) whilst applying a different function on another column (within groups)

查看:9
本文介绍了在列的子集 (.SDcols) 上应用函数,同时在另一列(组内)上应用不同的函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这与将通用函数应用于 data.table 的多个列的问题非常相似,该问题与 .SDcols 在这里彻底回答了.

This is very similar to a question applying a common function to multiple columns of a data.table uning .SDcols answered thoroughly here.

不同之处在于我想同时在不属于 .SD 子集的另一列上应用不同的函数.我在下面发布了一个简单的示例来展示我解决问题的尝试:

The difference is that I would like to simultaneously apply a different function on another column which is not part of the .SD subset. I post a simple example below to show my attempt to solve the problem:

dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
                v1 = rnorm(100), 
                v2 = rnorm(100), 
                v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1),  lapply(.SD,mean)), by = grp, .SDcols = sd.cols]

产生以下错误:

Error in `[.data.table`(dt, , list(v1 = sum(v1), lapply(.SD, mean)), by = grp,  
: object 'v1' not found

现在这是有道理的,因为 v1 列不包含在必须首先评估的列子集中.所以我通过将它包含在我的列子集中进一步探索:

Now this makes sense because the v1 column is not included in the subset of columns which must be evaluated first. So I explored further by including it in my subset of columns:

sd.cols = c("v1","v2", "v3")
dt.out = dt[, list(sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]

现在这不会导致错误,但它提供了一个包含 9 行(3 组)的答案,总和在列 V1 中重复三次以及所有 3 列的平均值(如预期但不需要)放在V2中如下图:

Now this does not cause an error but it provides an answer containing 9 rows (for 3 groups), with the sum repeated thrice in column V1 and the means for all 3 columns (as expected but not wanted) placed in V2 as shown below:

> dt.out 
   grp        V1                  V2
1:   c -1.070608 -0.0486639841313638
2:   c -1.070608  -0.178154270921521
3:   c -1.070608  -0.137625003604012
4:   b -2.782252 -0.0794929150464099
5:   b -2.782252  -0.149529237116445
6:   b -2.782252   0.199925178109264
7:   a  6.091355   0.141659419355985
8:   a  6.091355 -0.0272192037753071
9:   a  6.091355 0.00815760216214876

使用 2 个步骤的解决方法

显然,可以通过按组计算列子集的 mean 并将其加入到单个的 sum 中,从而分多个步骤解决问题列如下:

Clearly it is possible to solve the problem in multiple steps by calculating the mean by group for the subset of columns and joining it to the sum by group for the single column as follows:

dt.out1 = dt[, sum(v1), by = grp]
dt.out2 = dt[, lapply(.SD,mean), by = grp, .SDcols = sd.cols]
dt.out = merge(dt.out1, dt.out2, by = "grp")

> dt.out
   grp        V1         v2           v3
1:   a  6.091355 -0.0272192  0.008157602
2:   b -2.782252 -0.1495292  0.199925178
3:   c -1.070608 -0.1781543 -0.137625004

我确定这是我缺少的一个相当简单的事情,在此先感谢您的任何指导.

Im sure it's a fairly simple thing I am missing, thanks in advance for any guidance.

推荐答案

更新: 问题 #495 现在用 最近的提交 解决了,我们现在可以做到这一点就好了:

Update: Issue #495 is solved now with this recent commit, we can now do this just fine:

require(data.table) # v1.9.7+
set.seed(1L)
dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
                v1 = rnorm(100), 
                v2 = rnorm(100), 
                v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1),  lapply(.SD,mean)), by = grp, .SDcols = sd.cols]

但请注意,在这种情况下,v2 将作为列表返回.那是因为您正在有效地执行 list(val, list()).你打算做的也许是:

However note that in this case, v2 would be returned as a list. That's because you're doing list(val, list()) effectively. What you intend to do perhaps is:

dt[, c(list(v1=sum(v1)), lapply(.SD, mean)), by=grp, .SDcols = sd.cols]
#    grp        v1          v2         v3
# 1:   a -6.440273  0.16993940  0.2173324
# 2:   b  4.304350 -0.02553813  0.3381612
# 3:   c  0.377974 -0.03828672 -0.2489067

<小时>

查看历史以获取较早的答案.


See history for older answer.

这篇关于在列的子集 (.SDcols) 上应用函数,同时在另一列(组内)上应用不同的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆