在列的子集(.SDcols)上应用函数,同时对另一列(组内)应用不同的函数, [英] Apply function on a subset of columns (.SDcols) whilst applying a different function on another column (within groups)

查看:210
本文介绍了在列的子集(.SDcols)上应用函数,同时对另一列(组内)应用不同的函数,的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这与向 data.table uning .SDcols 此处完整回答



不同之处在于,我想在另一个不同于 .SD 子集的列上同时应用不同的函数。我发布一个简单的例子显示我试图解决这个问题:

  dt = data.table(grp = sample [1:3],100,replace = TRUE),
v1 = rnorm(100),
v2 = rnorm(100),
v3 = rnorm sd.cols = c(v2,v3)
dt.out = dt [,list(v1 = sum(v1),lapply(.SD,mean)),by = grp,.SDcols = sd.cols]

产生以下错误:

 错误在`[.data.table`(dt,,list(v1 = sum(v1),lapply(.SD,mean)),by = grp,
:未找到对象'v1'

现在这是有意义的,因为 v1



<$ p

$ p> sd.cols = c(v1,v2,v3)
dt.out = dt [平均值)),by = grp,.SDcols = sd.cols]

错误,但它提供了一个包含9行(用于3组)的答案,其中列 V1 中的总和重复三次,并且所有3列的平均值(如预期的但不是想要的)放置在 V2 中,如下所示:

  dt.out 
grp V1 V2
1:c -1.070608 -0.0486639841313638
2:c -1.070608 -0.178154270921521
3:c -1.070608 -0.137625003604012
4:b -2.782252 -0.0794929150464099
5:b -2.782252 -0.149529237116445
6:b -2.782252 0.199925178109264
7:a 6.091355 0.141659419355985
8:a 6.091355 -0.0272192037753071
9: a 6.091355 0.00815760216214876

解决方法使用两个步骤



很明显,可以通过对列子集计算 mean 来将问题解决为多个步骤,并将它加入 sum 按组显示如下:

  dt.out1 = dt [,sum(v1),by = grp] 
dt.out2 = dt [,lapply(.SD,mean),by = grp,.SDcols = sd.cols]
dt.out = merge(dt.out1,dt.out2,by =grp)

> dt.out
grp V1 v2 v3
1:a 6.091355 -0.0272192 0.008157602
2:b -2.782252 -0.1495292 0.199925178
3:c -1.070608 -0.1781543 -0.137625004

我确定这是一个很简单的事情,我提前得到任何指导。

解决方案

更新:问题#495 现在已通过此最近提交解决,我们现在可以执行此操作只要好:

  require(data.table)#v1.9.7 + 
set.seed(1L)
dt = data.table(grp = sample(letters [1:3],100,replace = TRUE),
v1 = rnorm(100),
v2 = rnorm $ b v3 = rnorm(100))
sd.cols = c(v2,v3)
dt.out = dt [,list(v1 = sum(v1),lapply SD,mean)),by = grp,.SDcols = sd.cols]

这种情况下, v2 将作为列表返回。这是因为你有效地执行 list(val,list())。你打算做什么也许是:

  dt [,c(list(v1 = sum(v1)),lapply SD,mean)),by = grp,.SDcols = sd.cols] 
#grp v1 v2 v3
#1:a -6.440273 0.16993940 0.2173324
#2:b 4.304350 -0.02553813 0.3381612
#3:c 0.377974 -0.03828672 -0.2489067






查看较早答案的历史记录。


This is very similar to a question applying a common function to multiple columns of a data.table uning .SDcols answered thoroughly here.

The difference is that I would like to simultaneously apply a different function on another column which is not part of the .SD subset. I post a simple example below to show my attempt to solve the problem:

dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
                v1 = rnorm(100), 
                v2 = rnorm(100), 
                v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1),  lapply(.SD,mean)), by = grp, .SDcols = sd.cols]

Yields the following error:

Error in `[.data.table`(dt, , list(v1 = sum(v1), lapply(.SD, mean)), by = grp,  
: object 'v1' not found

Now this makes sense because the v1 column is not included in the subset of columns which must be evaluated first. So I explored further by including it in my subset of columns:

sd.cols = c("v1","v2", "v3")
dt.out = dt[, list(sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]

Now this does not cause an error but it provides an answer containing 9 rows (for 3 groups), with the sum repeated thrice in column V1 and the means for all 3 columns (as expected but not wanted) placed in V2 as shown below:

> dt.out 
   grp        V1                  V2
1:   c -1.070608 -0.0486639841313638
2:   c -1.070608  -0.178154270921521
3:   c -1.070608  -0.137625003604012
4:   b -2.782252 -0.0794929150464099
5:   b -2.782252  -0.149529237116445
6:   b -2.782252   0.199925178109264
7:   a  6.091355   0.141659419355985
8:   a  6.091355 -0.0272192037753071
9:   a  6.091355 0.00815760216214876

Workaround Solution using 2 steps

Clearly it is possible to solve the problem in multiple steps by calculating the mean by group for the subset of columns and joining it to the sum by group for the single column as follows:

dt.out1 = dt[, sum(v1), by = grp]
dt.out2 = dt[, lapply(.SD,mean), by = grp, .SDcols = sd.cols]
dt.out = merge(dt.out1, dt.out2, by = "grp")

> dt.out
   grp        V1         v2           v3
1:   a  6.091355 -0.0272192  0.008157602
2:   b -2.782252 -0.1495292  0.199925178
3:   c -1.070608 -0.1781543 -0.137625004

Im sure it's a fairly simple thing I am missing, thanks in advance for any guidance.

解决方案

Update: Issue #495 is solved now with this recent commit, we can now do this just fine:

require(data.table) # v1.9.7+
set.seed(1L)
dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
                v1 = rnorm(100), 
                v2 = rnorm(100), 
                v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1),  lapply(.SD,mean)), by = grp, .SDcols = sd.cols]

However note that in this case, v2 would be returned as a list. That's because you're doing list(val, list()) effectively. What you intend to do perhaps is:

dt[, c(list(v1=sum(v1)), lapply(.SD, mean)), by=grp, .SDcols = sd.cols]
#    grp        v1          v2         v3
# 1:   a -6.440273  0.16993940  0.2173324
# 2:   b  4.304350 -0.02553813  0.3381612
# 3:   c  0.377974 -0.03828672 -0.2489067


See history for older answer.

这篇关于在列的子集(.SDcols)上应用函数,同时对另一列(组内)应用不同的函数,的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆