如何遍历列的所有组合并在R中按组应用功能? [英] How to iterate through all combinations of columns and apply function by group in R?

查看:64
本文介绍了如何遍历列的所有组合并在R中按组应用功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下名为 dt

  set.seed(1)
  dt <- data.table(expand.grid(c("a","b"),1:2,1:2,c("M","N","O","P","Q")))
  dt$perf <- rnorm(nrow(dt),0,.01)
  colnames(dt) <- c("ticker","par1","par2","row_names","perf")

我的目标是通过 row_names 遍历 par1 par2 的所有组合,并选择最大化 cumprod(mean(perf)+1)-1 .让我们看一下数据,以便在视觉上更有意义.

My goal is to iterate through all combinations of par1 and par2 by row_names and pick the one that maximizes cumprod(mean(perf)+1)-1. Let's look at the data so this makes more sense visually.

dt[order(row_names,ticker,par1,par2)]
    ticker par1 par2 row_names         perf
 1:      a    1    1         M  0.011462284
 2:      a    1    2         M -0.004252677
 3:      a    2    1         M  0.005727396
 4:      a    2    2         M -0.003892372
 5:      b    1    1         M -0.024030962
 6:      b    1    2         M  0.009510128
 7:      b    2    1         M  0.003747244
 8:      b    2    2         M -0.002843307

对于每个 ticker row_names ,我们有 par1 2 x 2 = 4 组合par2 ,即(1,1)(1,2)(2,1)(2,2).

For each ticker and row_names we have 2 x 2 = 4 combinations of par1 and par2, namely, (1,1) (1,2) (2,1) (2,2).

我想计算与相关的 perf mean ,其中所有 perf ticker = b 的所有其他组合相关联.使用上图中的数字,

I would like to calculate the mean of perf associated with ticker = a, par1 = 1, par2 = 1 with all the perf associated with all other combinations for ticker = b. Using numbers from the image above,

res
       a_perf       b_perf
1: 0.01146228 -0.024030962
2: 0.01146228  0.009510128
3: 0.01146228  0.003747244
4: 0.01146228 -0.002843307

apply(res,1,mean)
[1] -0.006284339  0.010486206  0.007604764  0.004309488

然后,我们对 ticker = a,par1 = 1,par2 = 2 重复此过程,并对 ticker = b 进行所有其他组合.

Then, we repeat this process for ticker = a, par1 = 1, par2 = 2 with all other combinations for ticker = b.

我们将对 par1 par2 与每个 row_names 的所有组合重复此过程.

We would repeat this process for all combinations of par1 and par2 with each row_names.

::使用@earch的建议,我们得到以下信息:

tmp <- lapply(split(dt, dt$row_names), calcCombMeans)
$M
   a.row b.row          mean
1      1     2 -0.0022140524
2      3     2 -0.0032599264
3      5     2  0.0025657555
4      7     2  0.0033553619
5      1     4  0.0048441350
6      3     4  0.0037982609
7      5     4  0.0096239429
8      7     4  0.0104135493
9      1     6 -0.0072346110
10     3     6 -0.0082804850
11     5     6 -0.0024548031
12     7     6 -0.0016651967
13     1     8  0.0005593545
14     3     8 -0.0004865195
15     5     8  0.0053391624
16     7     8  0.0061287688

在这里,我想为 row_names M,N,O,P,Q 选择 max(mean).一种实现方法是,如果我以后不关心引用索引的话,

From here, I would like to pick the max(mean) for row_names M,N,O,P,Q. One way to do that would be this if I did not care about referencing indices later on:

res <- sapply(1:length(tmp),function(i) which.max(tmp[[i]]$perf))
[1]  8  6  3 12 16

这将是我如何计算完成所需的最终结果:

This would be how I would calculate my desired end-result with completion:

res <- rbindlist(tmp,id="row_names")
  res <- res[,list(best=max(perf),best_idx = which.max(perf)),by=row_names]
   row_names        best best_idx
1:         M 0.010413549        8
2:         N 0.009508122        6
3:         O 0.009314068        3
4:         P 0.008883106       12
5:         Q 0.009316006       16

我还没有决定是否需要 best_idx 信息(我可能会为了复制特定 row_names 的精确计算),但是使用了此 res ,我可以通过以下方式计算我的 cumRet :

I haven't decided whether I need the best_idx information (I probably will in order to replicate the exact calculation of a specific row_names), but using this res, I can calculate my cumRet by doing:

res[,cumRet:= cumprod(best+1)-1]
> res
   row_names        best best_idx      cumRet
1:         M 0.010413549        8 0.01041355
2:         N 0.009508122        6 0.02002068
3:         O 0.009314068        3 0.02952123
4:         P 0.008883106       12 0.03866657
5:         Q 0.009316006       16 0.04834280

@earch的确可以帮助您查看计算所有这些组合的过程.我想知道通过使用 data.table 的功能是否有更有效的解决方案.我的真实数据集比这个数据集(数百万行)大得多,并且组合将开始造成损失.

@earch's really helps being able to see the process of calculating all these combinations. I was wondering if there was a more efficient solution through using data.table's functionality. My real data set is much larger than this (millions of rows), and the combinations will start to take a toll.

编辑#2 :::在能够逐步完成该过程之后,我想出了一个非常快速的解决方案!

tmp <- dt[,list(par1=par1[which.max(perf)],par2=par2[which.max(perf)],perf=max(perf)),by=list(ticker,row_names)]
    res <- tmp[,list(perf=mean(perf),par1= paste(par1,collapse=","),par2=paste(par2,collapse=",")),by=row_names]

使用 data.table 可使我按组和股票行情组合计算最大性能.然后,执行此操作后,我可以按 row_names 进行分组.并得到相同的结果!

Using data.table allows me to calculate the max perf by group and ticker combinations. Then after doing that, I can group by row_names. And it gets the same results!

> res
   row_names        perf par1 par2
1:         M 0.010413549  2,2  2,1
2:         N 0.009508122  2,2  1,1
3:         O 0.009314068  1,1  2,1
4:         P 0.008883106  2,1  2,2
5:         Q 0.009316006  2,2  2,2

推荐答案

编辑#2 :::在能够逐步完成该过程之后,我想出了一个非常快速的解决方案!

tmp <- dt[,list(par1=par1[which.max(perf)],par2=par2[which.max(perf)],
                                           perf=max(perf)),
                                           by=list(ticker,row_names)]
res <- tmp[,list(perf=mean(perf),par1= paste(par1,collapse=","),
                                          par2=paste(par2,collapse=",")),by=row_names]

使用 data.table 可使我按组和股票行情组合计算最大性能.然后,执行此操作后,我可以按 row_names 进行分组.并得到相同的结果!

Using data.table allows me to calculate the max perf by group and ticker combinations. Then after doing that, I can group by row_names. And it gets the same results!

> res
   row_names        perf par1 par2
1:         M 0.010413549  2,2  2,1
2:         N 0.009508122  2,2  1,1
3:         O 0.009314068  1,1  2,1
4:         P 0.008883106  2,1  2,2
5:         Q 0.009316006  2,2  2,2

这篇关于如何遍历列的所有组合并在R中按组应用功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆