解释 R 中的 quantile() 函数 [英] Explain the quantile() function in R
问题描述
我整天都对 R 分位数函数感到困惑.
I've been mystified by the R quantile function all day.
我对分位数的工作原理有一个直观的认识,并且获得了 M.S.在统计数据中,但是天哪,它的文档让我感到困惑.
I have an intuitive notion of how quantiles work, and an M.S. in stats, but boy oh boy, the documentation for it is confusing to me.
来自文档:
Q[i](p) = (1 - 伽玛) x[j] + 伽玛x[j+1],
Q[i](p) = (1 - gamma) x[j] + gamma x[j+1],
到目前为止我已经接受了.对于 i 类型的分位数,它是 x[j] 和 x [j+1] 之间的插值,基于一些神秘的常数 gamma
I'm with it so far. For a type i quantile, it's an interpolation between x[j] and x [j+1], based on some mysterious constant gamma
其中 1 <= i <= 9,(j-m)/n <= p <(j-m+1)/n, x[j]是第j阶统计量,n 是样本量,m是由样本确定的常数分位数类型.这里伽马取决于g = np+m-j 的小数部分.
where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.
那么,如何计算j?米?
So, how calculate j? m?
对于连续样本分位数类型(4 到 9),样本分位数可以通过线性获得第k阶之间的插值统计量和 p(k):
For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):
p(k) = (k - alpha)/(n - alpha - beta+ 1),其中 α 和 β 是确定的常数按类型.此外,m = alpha + p(1- alpha - beta),以及 gamma = g.
p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.
现在我真的迷路了.p,以前是一个常数,现在显然是一个函数.
Now I'm really lost. p, which was a constant before, is now apparently a function.
所以对于类型 7 分位数,默认...
So for Type 7 quantiles, the default...
类型 7
p(k) = (k - 1)/(n - 1).在这种情况下,p(k) = mode[F(x[k])].这是 S 使用的.
p(k) = (k - 1) / (n - 1). In this case, p(k) = mode[F(x[k])]. This is used by S.
有人想帮我吗?特别是我对 p 是一个函数和一个常数的符号感到困惑,m 到底是什么,现在要计算某些特定的 p 的 j.
Anyone want to help me out? In particular I'm confused by the notation of p being a function and a constant, what the heck m is, and now to calculate j for some particular p.
我希望基于这里的答案,我们可以提交一些修订后的文档,以更好地解释这里发生的事情.
I hope that based on the answers here, we can submit some revised documentation that better explains what is going on here.
quantile.R 源代码或输入:quantile.default
quantile.R source code or type: quantile.default
推荐答案
您感到困惑是可以理解的.那个文档很糟糕.我不得不回到基于 (Hyndman, RJ; Fan, Y. (1996 年 11 月) 的论文.统计包中的样本分位数".American Statistician 50 (4): 361–365. doi:10.2307/2684934) 以获得理解.让我们从第一个问题开始.
You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.
其中 1 <= i <= 9,(j-m)/n <= p <(j-m+1)/n,x[j]是j阶统计量,n是样本大小,m是由样本分位数类型决定的常数.这里 gamma 取决于 g = np+m-j 的小数部分.
where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.
第一部分直接来自论文,但文档作者省略了j = int(pn+m)
.这意味着 Q[i](p)
仅取决于最接近于通过(排序的)观察结果的 p
部分的两个顺序统计信息.(对于像我这样不熟悉这个术语的人来说,一系列观察的顺序统计"就是排序序列.)
The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m)
. This means Q[i](p)
only depends on the two order statistics closest to being p
fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)
另外,最后一句话是错误的.它应该读
Also, that last sentence is just wrong. It should read
这里的 gamma 取决于 np+m 的小数部分,g = np+m-j
Here gamma depends on the fractional part of np+m, g = np+m-j
至于 m
,这很简单.m
取决于选择了 9 种算法中的哪一种.所以就像 Q[i]
是分位数函数一样,m
应该被认为是 m[i]
.对于算法 1 和 2,m
为 0,对于算法 3,m
为 -1/2,对于其他算法,在下一部分.
As for m
that's straightforward. m
depends on which of the 9 algorithms was chosen. So just like Q[i]
is the quantile function, m
should be considered m[i]
. For algorithms 1 and 2, m
is 0, for 3, m
is -1/2, and for the others, that's in the next part.
对于连续样本分位数类型(4 到 9),样本分位数可以通过第 k 阶统计量和 p(k) 之间的线性插值获得:
For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):
p(k) = (k - alpha)/(n - alpha - beta + 1),其中 α 和 β 是由类型决定的常数.此外,m = alpha + p(1 - alpha - beta),并且 gamma = g.
p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.
这真的很令人困惑.文档所称的 p(k)
与之前的 p
不同.p(k)
是 绘图位置.论文中作者写成p
k
,有帮助.特别是因为在 m
的表达式中,p
是原始的 p
,并且 m = alpha + p * (1 -alpha - beta)
.从概念上讲,对于算法 4-9,点 (p
k
, x[k]
) 被插入到得到解(p
, Q[i](p)
).每种算法仅在 p
k
的算法上有所不同.
This is really confusing. What the documentation calls p(k)
is not the same as the p
from before. p(k)
is the plotting position. In the paper, the authors write it as p
k
, which helps. Especially since in the expression for m
, the p
is the original p
, and the m = alpha + p * (1 - alpha - beta)
. Conceptually, for algorithms 4-9, the points (p
k
, x[k]
) are interpolated to get the solution (p
, Q[i](p)
). Each algorithm only differs in the algorithm for the p
k
.
至于最后一点,R 只是说明了 S 的用途.
As for the last bit, R is just stating what S uses.
原始论文列出了 6 个样本分位数的理想属性";函数,并说明对 #8 的偏好,它满足所有 1.#5 满足所有这些,但他们不喜欢它在其他方面的原因(它更像是现象学而不是从原则派生出来的).#2 是像我这样的非统计极客会考虑的分位数,这也是维基百科中所描述的.
The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.
顺便说一句,为了回应 dreeves 答案,Mathematica做事有很大不同.我想我理解映射.虽然 Mathematica 更容易理解,但 (a) 使用无意义的参数更容易让自己陷入困境,并且 (b) 它不能执行 R 的算法 #2.(这里是 Mathworld 的分位数页面,其中指出 Mathematica 不能做 #2,但给出了一个更简单的概括所有其他算法的四个参数.)
BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)
这篇关于解释 R 中的 quantile() 函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!