在列的子集(.SDcols)上应用函数,同时对另一列(组内)应用不同的函数, [英] Apply function on a subset of columns (.SDcols) whilst applying a different function on another column (within groups)
问题描述
这与向 data.table
uning .SDcols
此处完整回答。
不同之处在于,我想在另一个不同于 .SD
子集的列上同时应用不同的函数。我发布一个简单的例子显示我试图解决这个问题:
dt = data.table(grp = sample [1:3],100,replace = TRUE),
v1 = rnorm(100),
v2 = rnorm(100),
v3 = rnorm sd.cols = c(v2,v3)
dt.out = dt [,list(v1 = sum(v1),lapply(.SD,mean)),by = grp,.SDcols = sd.cols]
产生以下错误:
错误在`[.data.table`(dt,,list(v1 = sum(v1),lapply(.SD,mean)),by = grp,
:未找到对象'v1'
现在这是有意义的,因为 v1
<$ p $ p>
sd.cols = c(v1,v2,v3)
dt.out = dt [平均值)),by = grp,.SDcols = sd.cols]
错误,但它提供了一个包含9行(用于3组)的答案,其中列 V1
中的总和重复三次,并且所有3列的平均值(如预期的但不是想要的)放置在 V2
中,如下所示:
dt.out
grp V1 V2
1:c -1.070608 -0.0486639841313638
2:c -1.070608 -0.178154270921521
3:c -1.070608 -0.137625003604012
4:b -2.782252 -0.0794929150464099
5:b -2.782252 -0.149529237116445
6:b -2.782252 0.199925178109264
7:a 6.091355 0.141659419355985
8:a 6.091355 -0.0272192037753071
9: a 6.091355 0.00815760216214876
解决方法使用两个步骤:
很明显,可以通过对列子集计算 mean
来将问题解决为多个步骤,并将它加入 sum
按组显示如下:
dt.out1 = dt [,sum(v1),by = grp]
dt.out2 = dt [,lapply(.SD,mean),by = grp,.SDcols = sd.cols]
dt.out = merge(dt.out1,dt.out2,by =grp)
> dt.out
grp V1 v2 v3
1:a 6.091355 -0.0272192 0.008157602
2:b -2.782252 -0.1495292 0.199925178
3:c -1.070608 -0.1781543 -0.137625004
我确定这是一个很简单的事情,我提前得到任何指导。
更新:问题#495 现在已通过此最近提交解决,我们现在可以执行此操作只要好:
require(data.table)#v1.9.7 +
set.seed(1L)
dt = data.table(grp = sample(letters [1:3],100,replace = TRUE),
v1 = rnorm(100),
v2 = rnorm $ b v3 = rnorm(100))
sd.cols = c(v2,v3)
dt.out = dt [,list(v1 = sum(v1),lapply SD,mean)),by = grp,.SDcols = sd.cols]
这种情况下, v2
将作为列表返回。这是因为你有效地执行 list(val,list())
。你打算做什么也许是:
dt [,c(list(v1 = sum(v1)),lapply SD,mean)),by = grp,.SDcols = sd.cols]
#grp v1 v2 v3
#1:a -6.440273 0.16993940 0.2173324
#2:b 4.304350 -0.02553813 0.3381612
#3:c 0.377974 -0.03828672 -0.2489067
查看较早答案的历史记录。
This is very similar to a question applying a common function to multiple columns of a data.table
uning .SDcols
answered thoroughly here.
The difference is that I would like to simultaneously apply a different function on another column which is not part of the .SD
subset. I post a simple example below to show my attempt to solve the problem:
dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
v1 = rnorm(100),
v2 = rnorm(100),
v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]
Yields the following error:
Error in `[.data.table`(dt, , list(v1 = sum(v1), lapply(.SD, mean)), by = grp,
: object 'v1' not found
Now this makes sense because the v1
column is not included in the subset of columns which must be evaluated first. So I explored further by including it in my subset of columns:
sd.cols = c("v1","v2", "v3")
dt.out = dt[, list(sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]
Now this does not cause an error but it provides an answer containing 9 rows (for 3 groups), with the sum repeated thrice in column V1
and the means for all 3 columns (as expected but not wanted) placed in V2
as shown below:
> dt.out
grp V1 V2
1: c -1.070608 -0.0486639841313638
2: c -1.070608 -0.178154270921521
3: c -1.070608 -0.137625003604012
4: b -2.782252 -0.0794929150464099
5: b -2.782252 -0.149529237116445
6: b -2.782252 0.199925178109264
7: a 6.091355 0.141659419355985
8: a 6.091355 -0.0272192037753071
9: a 6.091355 0.00815760216214876
Workaround Solution using 2 steps
Clearly it is possible to solve the problem in multiple steps by calculating the mean
by group for the subset of columns and joining it to the sum
by group for the single column as follows:
dt.out1 = dt[, sum(v1), by = grp]
dt.out2 = dt[, lapply(.SD,mean), by = grp, .SDcols = sd.cols]
dt.out = merge(dt.out1, dt.out2, by = "grp")
> dt.out
grp V1 v2 v3
1: a 6.091355 -0.0272192 0.008157602
2: b -2.782252 -0.1495292 0.199925178
3: c -1.070608 -0.1781543 -0.137625004
Im sure it's a fairly simple thing I am missing, thanks in advance for any guidance.
Update: Issue #495 is solved now with this recent commit, we can now do this just fine:
require(data.table) # v1.9.7+
set.seed(1L)
dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
v1 = rnorm(100),
v2 = rnorm(100),
v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]
However note that in this case, v2
would be returned as a list. That's because you're doing list(val, list())
effectively. What you intend to do perhaps is:
dt[, c(list(v1=sum(v1)), lapply(.SD, mean)), by=grp, .SDcols = sd.cols]
# grp v1 v2 v3
# 1: a -6.440273 0.16993940 0.2173324
# 2: b 4.304350 -0.02553813 0.3381612
# 3: c 0.377974 -0.03828672 -0.2489067
See history for older answer.
这篇关于在列的子集(.SDcols)上应用函数,同时对另一列(组内)应用不同的函数,的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!