解释 R 中的 quantile() 函数 [英] Explain the quantile() function in R

查看:18
本文介绍了解释 R 中的 quantile() 函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我整天都对 R 分位数函数感到困惑.

I've been mystified by the R quantile function all day.

我对分位数的工作原理有一个直观的认识,并且获得了 M.S.在统计数据中,但是天哪,它的文档让我感到困惑.

I have an intuitive notion of how quantiles work, and an M.S. in stats, but boy oh boy, the documentation for it is confusing to me.

来自文档:

Q[i](p) = (1 - 伽玛) x[j] + 伽玛x[j+1],

Q[i](p) = (1 - gamma) x[j] + gamma x[j+1],

到目前为止我已经接受了.对于 i 类型的分位数,它是 x[j] 和 x [j+1] 之间的插值,基于一些神秘的常数 gamma

I'm with it so far. For a type i quantile, it's an interpolation between x[j] and x [j+1], based on some mysterious constant gamma

其中 1 <= i <= 9,(j-m)/n <= p <(j-m+1)/n, x[j]是第j阶统计量,n 是样本量,m是由样本确定的常数分位数类型.这里伽马取决于g = np+m-j 的小数部分.

where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.

那么,如何计算j?米?

So, how calculate j? m?

对于连续样本分位数类型(4 到 9),样本分位数可以通过线性获得第k阶之间的插值统计量和 p(k):

For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):

p(k) = (k - alpha)/(n - alpha - beta+ 1),其中 α 和 β 是确定的常数按类型.此外,m = alpha + p(1- alpha - beta),以及 gamma = g.

p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.

现在我真的迷路了.p,以前是一个常数,现在显然是一个函数.

Now I'm really lost. p, which was a constant before, is now apparently a function.

所以对于类型 7 分位数,默认...

So for Type 7 quantiles, the default...

类型 7

p(k) = (k - 1)/(n - 1).在这种情况下,p(k) = mode[F(x[k])].这是 S 使用的.

p(k) = (k - 1) / (n - 1). In this case, p(k) = mode[F(x[k])]. This is used by S.

有人想帮我吗?特别是我对 p 是一个函数和一个常数的符号感到困惑,m 到底是什么,现在要计算某些特定的 p 的 j.

Anyone want to help me out? In particular I'm confused by the notation of p being a function and a constant, what the heck m is, and now to calculate j for some particular p.

我希望基于这里的答案,我们可以提交一些修订后的文档,以更好地解释这里发生的事情.

I hope that based on the answers here, we can submit some revised documentation that better explains what is going on here.

quantile.R 源代码或输入:quantile.default

quantile.R source code or type: quantile.default

推荐答案

您感到困惑是可以理解的.那个文档很糟糕.我不得不回到基于 (Hyndman, RJ; Fan, Y. (1996 年 11 月) 的论文.统计包中的样本分位数".American Statistician 50 (4): 361–365. doi:10.2307/2684934) 以获得理解.让我们从第一个问题开始.

You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.

其中 1 <= i <= 9,(j-m)/n <= p <(j-m+1)/n,x[j]是j阶统计量,n是样本大小,m是由样本分位数类型决定的常数.这里 gamma 取决于 g = np+m-j 的小数部分.

where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.

第一部分直接来自论文,但文档作者省略了j = int(pn+m).这意味着 Q[i](p) 仅取决于最接近于通过(排序的)观察结果的 p 部分的两个顺序统计信息.(对于像我这样不熟悉这个术语的人来说,一系列观察的顺序统计"就是排序序列.)

The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m). This means Q[i](p) only depends on the two order statistics closest to being p fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)

另外,最后一句话是错误的.它应该读

Also, that last sentence is just wrong. It should read

这里的 gamma 取决于 np+m 的小数部分,g = np+m-j

Here gamma depends on the fractional part of np+m, g = np+m-j

至于 m,这很简单.m 取决于选择了 9 种算法中的哪一种.所以就像 Q[i] 是分位数函数一样,m 应该被认为是 m[i].对于算法 1 和 2,m 为 0,对于算法 3,m 为 -1/2,对于其他算法,在下一部分.

As for m that's straightforward. m depends on which of the 9 algorithms was chosen. So just like Q[i] is the quantile function, m should be considered m[i]. For algorithms 1 and 2, m is 0, for 3, m is -1/2, and for the others, that's in the next part.

对于连续样本分位数类型(4 到 9),样本分位数可以通过第 k 阶统计量和 p(k) 之间的线性插值获得:

For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):

p(k) = (k - alpha)/(n - alpha - beta + 1),其中 α 和 β 是由类型决定的常数.此外,m = alpha + p(1 - alpha - beta),并且 gamma = g.

p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.

这真的很令人困惑.文档所称的 p(k) 与之前的 p 不同.p(k)绘图位置.论文中作者写成pk,有帮助.特别是因为在 m 的表达式中,p 是原始的 p,并且 m = alpha + p * (1 -alpha - beta).从概念上讲,对于算法 4-9,点 (pk, x[k]) 被插入到得到解(p, Q[i](p)).每种算法仅在 pk 的算法上有所不同.

This is really confusing. What the documentation calls p(k) is not the same as the p from before. p(k) is the plotting position. In the paper, the authors write it as pk, which helps. Especially since in the expression for m, the p is the original p, and the m = alpha + p * (1 - alpha - beta). Conceptually, for algorithms 4-9, the points (pk, x[k]) are interpolated to get the solution (p, Q[i](p)). Each algorithm only differs in the algorithm for the pk.

至于最后一点,R 只是说明了 S 的用途.

As for the last bit, R is just stating what S uses.

原始论文列出了 6 个样本分位数的理想属性";函数,并说明对 #8 的偏好,它满足所有 1.#5 满足所有这些,但他们不喜欢它在其他方面的原因(它更像是现象学而不是从原则派生出来的).#2 是像我这样的非统计极客会考虑的分位数,这也是维基百科中所描述的.

The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.

顺便说一句,为了回应 dreeves 答案,Mathematica做事有很大不同.我想我理解映射.虽然 Mathematica 更容易理解,但 (a) 使用无意义的参数更容易让自己陷入困境,并且 (b) 它不能执行 R 的算法 #2.(这里是 Mathworld 的分位数页面,其中指出 Mathematica 不能做 #2,但给出了一个更简单的概括所有其他算法的四个参数.)

BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)

这篇关于解释 R 中的 quantile() 函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆