最大样本 [英] Sample with a max

查看:46
本文介绍了最大样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我想对数字进行采样以创建向量,我会这样做:

If I want to sample numbers to create a vector I do:

set.seed(123)
x <- sample(1:100,200, replace = TRUE)
sum(x)
# [1] 10228

如果我想抽取 20 个总和为 100 的随机数,然后抽取 30 个数字但总和仍然为 100,该怎么办.我想这将比看起来更具挑战性.?sample 并在 Google 上搜索并没有为我提供线索.如果与所需的总和不够接近(例如在 5 以内),我想可能需要一些时间来进行采样然后拒绝.

What if I want to sample 20 random numbers that sum to 100, and then 30 numbers but still sum to 100. This I imagine will be more of a challenge than it seems. ?sample and searching Google has not provided me with a clue. And a loop to sample then reject if not close enough( e.g. within 5) of the desired sum I guess may take some time.

有没有更好的方法来实现这一目标?

Is there a better way to achieve this?

一个例子是:

foo(10,100) # ten random numbers that sum to 100. (not including zeros)
# 10,10,20,7,8,9,4,10,2,20

推荐答案

这是另一种尝试.它不使用sample,而是使用runif.我在显示总和的输出中添加了一个可选的消息",可以使用 showSum 参数触发.还有一个 Tolerance 参数指定需要多接近目标.

Here's another attempt. It doesn't use sample, but uses runif. I've added an optional "message" to the output showing the sum, which can be triggered using the showSum argument. There is also a Tolerance argument that specifies how close to the target is required.

SampleToSum <- function(Target = 100, VecLen = 10, 
                        InRange = 1:100, Tolerance = 2, 
                        showSum = TRUE) {
  Res <- vector()
  while ( TRUE ) {
    Res <- round(diff(c(0, sort(runif(VecLen - 1)), 1)) * Target)
    if ( all(Res > 0)  & 
         all(Res >= min(InRange)) &
         all(Res <= max(InRange)) &
         abs((sum(Res) - Target)) <= Tolerance ) { break }
  }
  if (isTRUE(showSum)) cat("Total = ", sum(Res), "\n")
  Res
}

这里有一些例子.

注意默认设置和设置Tolerance = 0

set.seed(1)
SampleToSum()
# Total =  101 
#  [1] 20  6 11 20  6  3 24  1  4  6
SampleToSum(Tolerance=0)
# Total =  100 
#  [1] 19 15  4 10  1 11  7 16  4 13

您可以使用 replicate 验证此行为.这是设置 Tolerance = 0 并运行该函数 5 次的结果.

You can verify this behavior by using replicate. Here's the result of setting Tolerance = 0 and running the function 5 times.

system.time(output <- replicate(5, SampleToSum(
  Target = 1376,
  VecLen = 13,
  InRange = 10:200,
  Tolerance = 0)))
# Total =  1376 
# Total =  1376 
# Total =  1376 
# Total =  1376 
# Total =  1376 
#    user  system elapsed 
#   0.144   0.000   0.145
output
#       [,1] [,2] [,3] [,4] [,5]
#  [1,]   29   46   11   43  171
#  [2,]  103  161  113  195  197
#  [3,]  145  134   91  131  147
#  [4,]  154  173  138   19   17
#  [5,]  197   62  173   11   87
#  [6,]  101  142   87  173   99
#  [7,]  168   61   97   40  121
#  [8,]  140  121   99  135  117
#  [9,]   46   78   31  200   79
# [10,]  140  168  146   17   56
# [11,]   21  146  117  182   85
# [12,]   63   30  180  179   78
# [13,]   69   54   93   51  122

同样设置Tolerance = 5并运行该函数5次.

And the same for setting Tolerance = 5 and running the function 5 times.

system.time(output <- replicate(5, SampleToSum(
  Target = 1376,
  VecLen = 13,
  InRange = 10:200,
  Tolerance = 5)))
# Total =  1375 
# Total =  1376 
# Total =  1374 
# Total =  1374 
# Total =  1376 
#    user  system elapsed 
#   0.060   0.000   0.058 
output
#       [,1] [,2] [,3] [,4] [,5]
#  [1,]   65  190  103   15   47
#  [2,]  160   95   98  196  183
#  [3,]  178  169  134   15   26
#  [4,]   49   53  186   48   41
#  [5,]  104   81  161  171  180
#  [6,]   54  126   67  130  182
#  [7,]   34  131   49  113   76
#  [8,]   17   21  107   62   95
#  [9,]  151  136  132  195  169
# [10,]  194  187   91  163   22
# [11,]   23   69   54   97   30
# [12,]  190   14  134   43  150
# [13,]  156  104   58  126  175

毫不奇怪,将容差设置为 0 会使函数变慢.

Not surprisingly, setting the tolerance to 0 would make the function slower.

请注意,由于这是一个随机"过程,因此很难猜测找到正确的数字组合需要多长时间.例如,使用set.seed(123),我连续运行了3次以下测试:

Note that since this is a "random" process, it's hard to guess how long it would take to find the right combination of numbers. For example, using set.seed(123), I ran the following test three times in a row:

system.time(SampleToSum(Target = 1163,
                        VecLen = 15,
                        InRange = 50:150))

第一次运行只用了 9 秒多一点.第二个只用了 7.5 秒多一点.第三个花了……不到 381 秒!这是一个很大的变化!

The first run took just over 9 seconds. The second took just over 7.5 seconds. The third took... just under 381 seconds! That's a lot of variation!

出于好奇,我在函数中添加了一个计数器,第一次运行需要 55026 次尝试才能得到满足我们所有条件的向量!(我没有费心尝试第二次和第三次尝试.)

Out of curiosity, I added a counter into the function, and the first run took 55026 attempts to arrive at a vector that satisfied all of our conditions! (I didn't bother trying for the second and third attempts.)

在函数中添加一些错误或健全性检查以确保输入合理可能会很好.例如,您不应该输入 SampleToSum(Target = 100, VecLen = 10, InRange = 15:50) 因为范围是 15 到 50,所以无法达到 100 AND向量中有 10 个值.

It might be good to add some error or sanity checking into the function to make sure the inputs are reasonable. For example, one should not be able to enter SampleToSum(Target = 100, VecLen = 10, InRange = 15:50) since with a range of 15 to 50, there's no way to get to 100 AND have 10 values in your vector.

这篇关于最大样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆