给定条件下随机关联两个向量的元素 [英] Randomly associate elements of two vectors given conditions

查看:128
本文介绍了给定条件下随机关联两个向量的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.table 资本

I have a data.table of capitals

capitals<-data.table(capital=c(100,50,25,5))
capitals
   capital
1:     100
2:      50
3:      25
4:       5

和一个data.table的损失

and a data.table of losses

losses<-data.table(loss=c(45,10,5,1))
losses
   loss
1:   45
2:   10
3:    5
4:    1

我想随机地将每个资本与损失(无替代)相关联,使损失小于或等于资本。在伪代码中,一个可能的实现是

I would like to randomly associate each capital with a loss (without replacement) such that the loss is less than or equal to the capital. In pseudo code one possible implementation would be

Set all capitalLoss to NA (i.e. capitals[, capitalLoss:=NA])
Order losses from largest to smallest
For each loss in losses
    randomly pick from capitals where capital>=loss and is.na(capitalLoss)
    set capitalLoss to loss
Next

如何实现这一点,使其非常高效?您可以假设资本损失具有相同的行数,并且至少一个映射如我所述

How can I implement this so that it's very efficient? You may assume that capitals and losses have the same number of rows and that at least one mapping as I described it is possible.

此示例可能的随机关联是

Possible random associations for this example are

   capital capitalLoss
1:     100          10
2:      50          45
3:      25           1
4:       5           5

   capital capitalLoss
1:     100          45
2:      50           1
3:      25          10
4:       5           5


推荐答案

对这个问题的天真的解决方案涉及到n个资本值的循环,并且对于每个资本值,搜索n个损失值,使得求解时间变化n ^ 2。对资本循环可能没有多少可以做,但损失搜索时间可以通过两种方式减少。首先,找到需要搜索的损失的上限,可以通过排序和使用findInterval(),然后在大写循环中的第二个可能的损失被传递给sample()的列表可以找到Alex和Shambho do更新为我在下面,而不是从整个列表重新创建。由于可能损失的列表的大小总是远小于n,所以该方法的执行时间与n几乎线性地增加,这导致对于该范围n的显着减少的执行时间。在循环中的每次迭代中创建具有完全空间的丢失跟踪向量而不是分配空间也是有帮助的。我的函数也返回与输入看起来合适的资本值相同的顺序的结果。 Microbenchmark报告ffben和ffwalt的时间,如下所示为Ben的数据集。请注意,时间以毫秒为单位。

The naive solution to this problem involves a loop over n capital values and, for each capital value, a search over n loss values so that the solution time varies by n^2. Probably not much can be done about the capital loop, but the loss search time can be reduced in two ways. First, find the upper bounds for the losses which need to be searched can be found as Alex and Shambho do by sorting and using findInterval() and then second within the capital loop the list of possible losses to be passed to sample() can be updated as I do below rather than re-created from the entire list. Since the size of the list of possible losses is always much smaller than n, the execution times with this approach increase more nearly linearly with n which results in significantly reduced execution times for this range of n. It’s also helpful to create the loss tracking vector with full space rather than alloc space on each iteration in loop. My function also returns the results in the same order as the capital values were input which seems proper. Microbenchmark reports the times for ffben and ffwalt as shown below for both of Ben’s data sets. Note that times are in milliseconds.

Unit: milliseconds

              expr         min         lq      median          uq         max neval
    ffben(cap2, los2)   1549.8289   1556.113   1565.7139   1592.3230   1593.9527     5
   ffwalt(cap2, los2)    205.4834    206.267    206.5975    207.0464    212.9808     5
 ffben(capital, loss) 154235.8823 154855.444 154969.9196 155052.6070 156250.5489     5
ffwalt(capital, loss)   2071.3610   2074.692   2099.4889   2100.1091   2117.4721     5

由于资本数据集是10倍cap2数据集的大小,看起来ffben的时间增加为n ^ 2,而ffwalt的时间仅线性增加,两者都如预期。

Since the capital data set is 10x the size of the cap2 data set, it appears that the times for ffben increase as n^2 while the times for ffwalt increase only linearly, both as expected.

ffwalt <- function( caps, loss) {
len_cap <- length(caps)
loss_srt <- sort(loss)
caps_ord <- order(caps)
caps_srt <- caps[caps_ord]
cap_mx_ls_idx <- findInterval(caps_srt, loss_srt)  # find upper loss bounds for each value of capital
loss_picked <- vector("numeric",len_cap)  #  alocate space for full loss vector to avoid mem alloc time in capital loop
samp <- seq_len(cap_mx_ls_idx[1])
for( i in seq_len(len_cap-1) )  {
  loss_picked[i] <- sample(x=samp,1, replace=FALSE)
  if(cap_mx_ls_idx[i+1] > cap_mx_ls_idx[i]) 
       add_samp <- seq(cap_mx_ls_idx[i]+1,cap_mx_ls_idx[i+1],1)
  else add_samp  <- NULL
  samp <- c(samp[samp != loss_picked[i]], add_samp)
}
loss_picked[len_cap] <- samp             # avoid problem with sample() when x has length 1
results <- data.frame(capital=caps_srt, loss=loss_srt[loss_picked])
results[caps_ord,] <- results            # restore original caps order
return(results)
}

这篇关于给定条件下随机关联两个向量的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆