当样本中的 prob 参数总和小于/大于 1 时会发生什么? [英] What happens when prob argument in sample sums to less/greater than 1?

查看:48
本文介绍了当样本中的 prob 参数总和小于/大于 1 时会发生什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们知道sample中的prob参数用于分配权重的概率.

We know that prob argument in sample is used to assign a probability of weights.

例如,

table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.4, 0.3, 0.1)))/1e6

#  1   2   3   4 
#0.2 0.4 0.3 0.1 


table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.4, 0.3, 0.1)))/1e6

#    1     2     3     4 
#0.200 0.400 0.299 0.100 

在这个例子中,概率之和正好是 1 (0.2 + 0.4 + 0.3 + 0.1),因此它给出了预期的比率,但如果概率之和不为 1 呢?它会给出什么输出?我认为这会导致错误,但它提供了一些价值.

In this example, the sum of probability is exactly 1 (0.2 + 0.4 + 0.3 + 0.1), hence it gives the expected ratio but what if the probability does not sum to 1? What output would it give? I thought it would result in an error but it gives some value.

当概率总和大于 1 时.

When the probability sums up to more than 1.

table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.5, 0.5, 0.1)))/1e6

#     1      2      3      4 
#0.1544 0.3839 0.3848 0.0768 

table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.5, 0.5, 0.1)))/1e6

#     1      2      3      4 
#0.1544 0.3842 0.3848 0.0767 

当概率总和小于 1

table(sample(1:4, 1e6, replace = TRUE, prob = c(0.1, 0.1, 0.5, 0.1)))/1e6

#    1     2     3     4 
#0.124 0.125 0.625 0.125 

table(sample(1:4, 1e6, replace = TRUE, prob = c(0.1, 0.1, 0.5, 0.1)))/1e6

#    1     2     3     4 
#0.125 0.125 0.625 0.125 

如我们所见,多次运行给出的输出不等于 prob,但结果也不是随机的.在这种情况下,数字是如何分布的?它记录在哪里?

As we can see, running multiple times gives the output which is not equal to prob but the results are not random as well. How are the numbers distributed in this case? Where is it documented?

我尝试在互联网上搜索,但没有找到任何相关信息.我查看了 ?sample 中的文档,其中有

I tried searching on the internet but didn't find any relevant information. I looked through the documentation at ?sample which has

可选的 prob 参数可用于给出权重向量,以获取被采样向量的元素.它们的总和不必为 1,但它们应该是非负数且不能全为零.如果 replace 为真,则当存在超过 200 个合理可能的值时,将使用 Walker 的别名方法(Ripley,1987):这给出的结果与 R <的结果不兼容.2.2.0.

The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be non-negative and not all zero. If replace is true, Walker's alias method (Ripley, 1987) is used when there are more than 200 reasonably probable values: this gives results incompatible with those from R < 2.2.0.

所以它说 prob 参数不需要总和为 1,但没有说明当它总和不为 1 时预期是什么?我不确定我是否遗漏了文档的任何部分.有人有任何想法吗?

So it says that the prob argument need not sum to 1 but doesn't tell what is expected when it doesn't sum to 1? I am not sure if I am missing any part of the documentation. Does anybody have any idea?

推荐答案

好问题.文档对此不清楚,但可以通过查看源代码来回答这个问题.

Good question. The docs are unclear on this, but the question can be answered by reviewing the source code.

如果你看 R 代码,sample 总是调用另一个 R 函数,sample.int 如果你传入一个数字 x对于 sample,它将使用 sample.int 创建一个小于或等于该数字的整数向量,而如果 x 是一个向量,它使用 sample.int 生成小于或等于 length(x) 的整数样本,然后使用它对 x 进行子集.

If you look at the R code, sample always calls another R function, sample.int If you pass in a single number x to sample, it will use sample.int to create a vector of integers less than or equal to that number, whereas if x is a vector, it uses sample.int to generate a sample of integers less than or equal to length(x), then uses that to subset x.

现在,如果你检查函数 sample.int,它看起来像这样:

Now, if you examine the function sample.int, it looks like this:

function (n, size = n, replace = FALSE, prob = NULL, useHash = (!replace && 
    is.null(prob) && size <= n/2 && n > 1e+07)) 
{
    if (useHash) 
        .Internal(sample2(n, size))
    else .Internal(sample(n, size, replace, prob))
}

.Internal 表示任何采样都是通过调用用 C 编写的编译代码完成的:在这种情况下,它是函数 do_sample,定义了 这里在src/main/random.c.

The .Internal means any sampling is done by calling compiled code written in C: in this case, it's the function do_sample, defined here in src/main/random.c.

如果您查看此 C 代码,do_sample 会检查它是否已传递prob 向量.如果不是,则在相等权重的假设下进行采样.如果 prob 存在,该函数确保它是数字而不是 NA.如果 prob 通过这些检查,则会生成一个指向底层双精度数组的指针,并将其传递给 random.c 中名为 FixUpProbs 的另一个函数,定义 这里.

If you look at this C code, do_sample checks whether it has been passed a prob vector. If not, it samples on the assumption of equal weights. If prob exists, the function ensures that it is numeric and not NA. If prob passes these checks, a pointer to the underlying array of doubles is generated and passed to another function in random.c called FixUpProbs, defined here.

该函数检查 prob 的每个成员,如果 prob 的任何元素不是正有限双精度,则抛出错误.然后它通过将每个数字除以所有数字的总和来标准化这些数字.因此,对于代码中固有的 prob 总和为 1,根本没有任何偏好.也就是说,即使 prob 在您的输入中的总和为 1,该函数仍会计算总和并将每个数字除以它.

This function examines each member of prob and throws an error if any elements of prob are not positive finite doubles. It then normalises the numbers by dividing each by the sum of all. There is therefore no preference at all for prob summing to 1 inherent in the code. That is, even if prob sums to 1 in your input, the function will still calculate the sum and divide each number by it.

因此,该参数命名不当.正如这里的其他人指出的那样,它应该是权重".公平地说,文档只说 prob 应该是权重向量,而不是绝对概率.

Therefore, the parameter is poorly named. It should be "weights", as others here have pointed out. To be fair, the docs only say that prob should be a vector of weights, not absolute probabilities.

所以我阅读代码的 prob 参数的行为应该是:

So the behaviour of the prob parameter from my reading of the code should be:

  1. prob 可以完全不存在,在这种情况下,采样默认为相等的权重.
  2. 如果任何 prob 的数字小于零,或者是无限的,或者不适用,函数将抛出.
  3. 如果任何 prob 值是非数字值,则应抛出错误,因为它们将在传递给 C 代码的 SEXP 中解释为 NA.
  4. prob 必须与 x 的长度相同,否则 C 代码会抛出
  5. 如果您指定了 replace=T,您可以将零概率作为 prob 的一个或多个元素传递,只要您有至少一个非零概率.
  6. 如果您指定replace=F,则您请求的样本数必须小于或等于prob 中的非零元素数.本质上,如果您要求它以零概率进行采样,FixUpProbs 会抛出异常.
  7. 一个有效的 prob 向量将被归一化为总和为 1 并用作采样权重.
  1. prob can be absent altogether, in which case sampling defaults to equal weights.
  2. If any of prob's numbers are less than zero, or are infinite, or NA, the function will throw.
  3. An error should be thrown if any of the prob values are non-numeric, as they will be interpreted as NA in the SEXP passed to the C code.
  4. prob must have the same length as x or the C code throws
  5. You can pass a zero probability as one or more elements of prob if you have specified replace=T, as long as you have at least one non-zero probability.
  6. If you specify replace=F, the number of samples you request must be less than or equal to the number of non-zero elements in prob. Essentially, FixUpProbs will throw if you ask it to sample with a zero probability.
  7. A valid prob vector will be normalised to sum to 1 and used as sampling weights.

作为这种行为的一个有趣的副作用,如果您通过设置 probs = c(1,odds)

As an interesting side effect of this behaviour, this allows you to use odds instead of probabilities if you are choosing between 2 alternatives by setting probs = c(1, odds)

这篇关于当样本中的 prob 参数总和小于/大于 1 时会发生什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆