在列的子集 (.SDcols) 上应用函数,同时在另一列(组内)上应用不同的函数 [英] Apply function on a subset of columns (.SDcols) whilst applying a different function on another column (within groups)
问题描述
这与将通用函数应用于 data.table
的多个列的问题非常相似,该问题与 .SDcols
在这里彻底回答了.
This is very similar to a question applying a common function to multiple columns of a data.table
uning .SDcols
answered thoroughly here.
不同之处在于我想同时在不属于 .SD
子集的另一列上应用不同的函数.我在下面发布了一个简单的示例来展示我解决问题的尝试:
The difference is that I would like to simultaneously apply a different function on another column which is not part of the .SD
subset. I post a simple example below to show my attempt to solve the problem:
dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
v1 = rnorm(100),
v2 = rnorm(100),
v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]
产生以下错误:
Error in `[.data.table`(dt, , list(v1 = sum(v1), lapply(.SD, mean)), by = grp,
: object 'v1' not found
现在这是有道理的,因为 v1
列不包含在必须首先评估的列子集中.所以我通过将它包含在我的列子集中进一步探索:
Now this makes sense because the v1
column is not included in the subset of columns which must be evaluated first. So I explored further by including it in my subset of columns:
sd.cols = c("v1","v2", "v3")
dt.out = dt[, list(sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]
现在这不会导致错误,但它提供了一个包含 9 行(3 组)的答案,总和在列 V1
中重复三次以及所有 3 列的平均值(如预期但不需要)放在V2
中如下图:
Now this does not cause an error but it provides an answer containing 9 rows (for 3 groups), with the sum repeated thrice in column V1
and the means for all 3 columns (as expected but not wanted) placed in V2
as shown below:
> dt.out
grp V1 V2
1: c -1.070608 -0.0486639841313638
2: c -1.070608 -0.178154270921521
3: c -1.070608 -0.137625003604012
4: b -2.782252 -0.0794929150464099
5: b -2.782252 -0.149529237116445
6: b -2.782252 0.199925178109264
7: a 6.091355 0.141659419355985
8: a 6.091355 -0.0272192037753071
9: a 6.091355 0.00815760216214876
使用 2 个步骤的解决方法
显然,可以通过按组计算列子集的 mean
并将其加入到单个的 sum
中,从而分多个步骤解决问题列如下:
Clearly it is possible to solve the problem in multiple steps by calculating the mean
by group for the subset of columns and joining it to the sum
by group for the single column as follows:
dt.out1 = dt[, sum(v1), by = grp]
dt.out2 = dt[, lapply(.SD,mean), by = grp, .SDcols = sd.cols]
dt.out = merge(dt.out1, dt.out2, by = "grp")
> dt.out
grp V1 v2 v3
1: a 6.091355 -0.0272192 0.008157602
2: b -2.782252 -0.1495292 0.199925178
3: c -1.070608 -0.1781543 -0.137625004
我确定这是我缺少的一个相当简单的事情,在此先感谢您的任何指导.
Im sure it's a fairly simple thing I am missing, thanks in advance for any guidance.
推荐答案
更新: 问题 #495 现在用 最近的提交 解决了,我们现在可以做到这一点就好了:
Update: Issue #495 is solved now with this recent commit, we can now do this just fine:
require(data.table) # v1.9.7+
set.seed(1L)
dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
v1 = rnorm(100),
v2 = rnorm(100),
v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]
但请注意,在这种情况下,v2
将作为列表返回.那是因为您正在有效地执行 list(val, list())
.你打算做的也许是:
However note that in this case, v2
would be returned as a list. That's because you're doing list(val, list())
effectively. What you intend to do perhaps is:
dt[, c(list(v1=sum(v1)), lapply(.SD, mean)), by=grp, .SDcols = sd.cols]
# grp v1 v2 v3
# 1: a -6.440273 0.16993940 0.2173324
# 2: b 4.304350 -0.02553813 0.3381612
# 3: c 0.377974 -0.03828672 -0.2489067
<小时>
查看历史以获取较早的答案.
See history for older answer.
这篇关于在列的子集 (.SDcols) 上应用函数,同时在另一列(组内)上应用不同的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!