如何遍历列的所有组合并在R中按组应用功能? [英] How to iterate through all combinations of columns and apply function by group in R?
问题描述
我有以下名为 dt
set.seed(1)
dt <- data.table(expand.grid(c("a","b"),1:2,1:2,c("M","N","O","P","Q")))
dt$perf <- rnorm(nrow(dt),0,.01)
colnames(dt) <- c("ticker","par1","par2","row_names","perf")
我的目标是通过 row_names
遍历 par1
和 par2
的所有组合,并选择最大化 cumprod(mean(perf)+1)-1
.让我们看一下数据,以便在视觉上更有意义.
My goal is to iterate through all combinations of par1
and par2
by row_names
and pick the one that maximizes cumprod(mean(perf)+1)-1
.
Let's look at the data so this makes more sense visually.
dt[order(row_names,ticker,par1,par2)]
ticker par1 par2 row_names perf
1: a 1 1 M 0.011462284
2: a 1 2 M -0.004252677
3: a 2 1 M 0.005727396
4: a 2 2 M -0.003892372
5: b 1 1 M -0.024030962
6: b 1 2 M 0.009510128
7: b 2 1 M 0.003747244
8: b 2 2 M -0.002843307
对于每个 ticker
和 row_names
,我们有 par1
和的
,即 2 x 2 = 4
组合par2 (1,1)(1,2)(2,1)(2,2)
.
For each ticker
and row_names
we have 2 x 2 = 4
combinations of par1
and par2
, namely, (1,1) (1,2) (2,1) (2,2)
.
我想计算与 perf
的 mean
,其中所有 perf
与 ticker = b
的所有其他组合相关联.使用上图中的数字,
I would like to calculate the mean
of perf
associated with ticker = a, par1 = 1, par2 = 1
with all the perf
associated with all other combinations for ticker = b
. Using numbers from the image above,
res
a_perf b_perf
1: 0.01146228 -0.024030962
2: 0.01146228 0.009510128
3: 0.01146228 0.003747244
4: 0.01146228 -0.002843307
apply(res,1,mean)
[1] -0.006284339 0.010486206 0.007604764 0.004309488
然后,我们对 ticker = a,par1 = 1,par2 = 2
重复此过程,并对 ticker = b
进行所有其他组合.
Then, we repeat this process for ticker = a, par1 = 1, par2 = 2
with all other combinations for ticker = b
.
我们将对 par1
和 par2
与每个 row_names
的所有组合重复此过程.
We would repeat this process for all combinations of par1
and par2
with each row_names
.
::使用@earch的建议,我们得到以下信息:
tmp <- lapply(split(dt, dt$row_names), calcCombMeans)
$M
a.row b.row mean
1 1 2 -0.0022140524
2 3 2 -0.0032599264
3 5 2 0.0025657555
4 7 2 0.0033553619
5 1 4 0.0048441350
6 3 4 0.0037982609
7 5 4 0.0096239429
8 7 4 0.0104135493
9 1 6 -0.0072346110
10 3 6 -0.0082804850
11 5 6 -0.0024548031
12 7 6 -0.0016651967
13 1 8 0.0005593545
14 3 8 -0.0004865195
15 5 8 0.0053391624
16 7 8 0.0061287688
在这里,我想为 row_names M,N,O,P,Q
选择 max(mean)
.一种实现方法是,如果我以后不关心引用索引的话,
From here, I would like to pick the max(mean)
for row_names M,N,O,P,Q
. One way to do that would be this if I did not care about referencing indices later on:
res <- sapply(1:length(tmp),function(i) which.max(tmp[[i]]$perf))
[1] 8 6 3 12 16
这将是我如何计算完成所需的最终结果:
This would be how I would calculate my desired end-result with completion:
res <- rbindlist(tmp,id="row_names")
res <- res[,list(best=max(perf),best_idx = which.max(perf)),by=row_names]
row_names best best_idx
1: M 0.010413549 8
2: N 0.009508122 6
3: O 0.009314068 3
4: P 0.008883106 12
5: Q 0.009316006 16
我还没有决定是否需要 best_idx
信息(我可能会为了复制特定 row_names
的精确计算),但是使用了此 res
,我可以通过以下方式计算我的 cumRet
:
I haven't decided whether I need the best_idx
information (I probably will in order to replicate the exact calculation of a specific row_names
), but using this res
, I can calculate my cumRet
by doing:
res[,cumRet:= cumprod(best+1)-1]
> res
row_names best best_idx cumRet
1: M 0.010413549 8 0.01041355
2: N 0.009508122 6 0.02002068
3: O 0.009314068 3 0.02952123
4: P 0.008883106 12 0.03866657
5: Q 0.009316006 16 0.04834280
@earch的确可以帮助您查看计算所有这些组合的过程.我想知道通过使用 data.table
的功能是否有更有效的解决方案.我的真实数据集比这个数据集(数百万行)大得多,并且组合将开始造成损失.
@earch's really helps being able to see the process of calculating all these combinations. I was wondering if there was a more efficient solution through using data.table
's functionality. My real data set is much larger than this (millions of rows), and the combinations will start to take a toll.
编辑#2 :::在能够逐步完成该过程之后,我想出了一个非常快速的解决方案!
tmp <- dt[,list(par1=par1[which.max(perf)],par2=par2[which.max(perf)],perf=max(perf)),by=list(ticker,row_names)]
res <- tmp[,list(perf=mean(perf),par1= paste(par1,collapse=","),par2=paste(par2,collapse=",")),by=row_names]
使用 data.table
可使我按组和股票行情组合计算最大性能.然后,执行此操作后,我可以按 row_names
进行分组.并得到相同的结果!
Using data.table
allows me to calculate the max perf by group and ticker combinations. Then after doing that, I can group by row_names
. And it gets the same results!
> res
row_names perf par1 par2
1: M 0.010413549 2,2 2,1
2: N 0.009508122 2,2 1,1
3: O 0.009314068 1,1 2,1
4: P 0.008883106 2,1 2,2
5: Q 0.009316006 2,2 2,2
推荐答案
编辑#2 :::在能够逐步完成该过程之后,我想出了一个非常快速的解决方案! >
tmp <- dt[,list(par1=par1[which.max(perf)],par2=par2[which.max(perf)],
perf=max(perf)),
by=list(ticker,row_names)]
res <- tmp[,list(perf=mean(perf),par1= paste(par1,collapse=","),
par2=paste(par2,collapse=",")),by=row_names]
使用 data.table
可使我按组和股票行情组合计算最大性能.然后,执行此操作后,我可以按 row_names
进行分组.并得到相同的结果!
Using data.table
allows me to calculate the max perf by group and ticker combinations. Then after doing that, I can group by row_names
. And it gets the same results!
> res
row_names perf par1 par2
1: M 0.010413549 2,2 2,1
2: N 0.009508122 2,2 1,1
3: O 0.009314068 1,1 2,1
4: P 0.008883106 2,1 2,2
5: Q 0.009316006 2,2 2,2
这篇关于如何遍历列的所有组合并在R中按组应用功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!