获取 SciPy 分位数以匹配 Stata xtile 函数 [英] Getting SciPy quantiles to match Stata xtile function

查看:93
本文介绍了获取 SciPy 分位数以匹配 Stata xtile 函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我继承了一些旧的 Stata 代码 (Stata11),它使用 xtile 函数按分位数对向量中的观察进行分类(在这种情况下,只是标准的 5 个五分位数、20%、40%、60%、80%、100%).

I've inherited some old Stata code (Stata11) that uses the xtile function to categorize observations in a vector by their quantiles (in this case, just the standard 5 quintiles, 20%, 40%, 60%, 80%, 100%).

我正在尝试在 Python 中复制一段代码,并且我正在使用 SciPy.stats.mstats 函数 mquantiles() 进行计算.

I'm trying to replicate a piece of the code in Python and I am using the SciPy.stats.mstats function mquantiles() for the computation.

据我从 Stata 文档和在线搜索中得知,Stata xtile 方法尝试反转数据的经验 CDF,并使用所有观测值的等加权平均值CDF 是平坦的以制作切点.这似乎是一种非常糟糕的分位数分类方式,但事实就是如此,我相信在某些情况下这是正确的做法.

As near as I can tell from Stata documentation and searching online, the Stata xtile method tries to invert the empirical CDF of the data, and uses the equal-weighted average of all observations for which the CDF is flat to make the cutpoint. This seems like a very poor way to categorize quantiles, but it is what it is and I am sure there are cases where this is the right thing to do.

我的问题是如何使 mquantiles() 产生相同类型的破坏约定.我注意到这个函数有两个参数,alphapbeta(文档称它们为 alphabeta 但你需要额外的p"才能使其工作,至少我是这样做的……如果我只在 Python 2.7.1 和 SciPy 0.10.0 中使用alpha"和beta",我会收到错误消息).但即使在 SciPy 文档中,我也看不到这些参数的组合是否会在平坦的 CDF 范围内产生平均值.

My question is how to make mquantiles() produce the same sort of breaking convention. I noticed that this function has two parameters, alphap and betap (the documentation calls them alpha and beta but you need the extra 'p' to get it to work, at least I do... I get an error if I just use 'alpha' and 'beta' with Python 2.7.1 and SciPy 0.10.0). But even in the SciPy docs, I can't see whether there's a combination of these parameters that produces the mean over flat CDF ranges.

我看到了计算这个范围的中值或众数的选项,但不是平均值(也不清楚这些带有 alpha 和 beta 的 SciPy 中值/众数选项是否计算为 观察值或产生平坦 CDF 值的范围.)

I see what looks like the option to compute as the median or mode of this range, but not mean (it's also not clear if these SciPy median/mode options with alpha and beta are computed as the median/mode of the observations or of the range that would produce the flat CDF value.)

任何帮助消除这些不同选项的歧义并找到一些帮助我在 Python 中重新创建 Stata 约定的文档都会很棒.请避免只说编写自己的分位数函数"的答案.首先,这并不能帮助我理解 Stata 或 SciPy 的约定,其次,鉴于这些数值库,编写我自己的分位数函数应该是最后的手段.我当然可以做到,但如果我需要,那会很糟糕.

Any help disambiguating these different options and finding some documentation that helps me recreate the Stata convention in Python would be great. Please refrain from answers that just say "write your own quantile function." Firstly, that doesn't help me understand the conventions of either Stata or SciPy, and secondly, given these numerical libraries, writing my own quantile function should be a last resort. I can certainly do it, but it would be bad all around if I need to.

推荐答案

scipy.stats.mquantiles 文档很差,并且在某些地方错误,现在已修复,因此可能会有所帮助...http://docs.scipy.org/scipy/docs/scipy.stats.mstats_basic.mquantiles/.当您指出 alpha/beta、alphap/betap 差异时,该过程就开始了.谢谢.

The scipy.stats.mquantiles documentation was poor and wrong in places, fixed now so that might be helpful... http://docs.scipy.org/scipy/docs/scipy.stats.mstats_basic.mquantiles/. That process started when you pointed out the alpha/beta, alphap/betap discrepancy. Thank you.

mquantiles 的实现遵循 R.

The implementation of mquantiles follow R.

最大的区别在于 R 有 9 种离散类型,其中因为 scipy.stats.mquantiles 从 'alphap' 和 'betap' 计算出 'm',scipy 具有连续范围的类型"(因为缺乏更好的字).

The biggest difference comes from that R has 9 discrete types, where because scipy.stats.mquantiles calculates 'm' from 'alphap' and 'betap', scipy has a continuous range of "types" (for lack of a better word).

我承认我不了解所涉及的统计数据的所有来龙去脉,所以我决定进行暴力评估.我在 http://www.biostat.sdu.dk/找到了一个 xtile 示例~biostat/StataReferenceManual/StataRef.pdf 并且能够将结果与 alphap=0.5 和 betap=0.5(分段线性)相匹配.不是确定的,也不是详尽的,但我现在拥有的一切.

I admit that I do not understand all of the ins and outs of the statistics involved so I settled on a brute force evaluation. I found an xtile example at http://www.biostat.sdu.dk/~biostat/StataReferenceManual/StataRef.pdf and was able to match the results with alphap=0.5, and betap=0.5 (piecewise linear). Not definitive nor exhaustive, but all I have right now.

In [1]: import scipy.stats as st

In [9]: st.mstats.mquantiles([23,56,67,123,99,17],prob=[0.5],alphap=0.5,betap=.5)
Out[9]: array([ 61.5])

In [10]: st.mstats.mquantiles([23,56,67,123,99,17],prob=[0.33,0.66],alphap=0.5,betap=.5)
Out[10]: array([ 38.84,  81.72])

In [11]: st.mstats.mquantiles([23,56,67,123,99,17],prob=[0.25,0.5,0.75],alphap=0.5,betap=.5)
Out[11]: array([ 23. ,  61.5,  99. ])

最后一个有点问题,因为两个分割点正好在数据集中的值上.Stata/xtile(至少在我发现的例子中)没有给出分位数的分割点,而是给出分位数本身.给定排序后的数据集 [17,23,56,67,99,123],Stata/xtile 给出的分类为 [1,1,2,3,3,4],这意味着对于 scipy.stat.mquantiles 匹配上分位数的界限大于或等于该分位数中的所有值.

The last is a little problematic since two of the division points are exactly on values in the data set. Stata/xtile (at least in the examples that I found) does not give the split points for the quantiles but gives the quantiles themselves. Given the sorted data set [17,23,56,67,99,123], Stata/xtile gave the categorization as [1,1,2,3,3,4] which means that for scipy.stat.mquantiles to match the upper bound of a quantile is greater than or equal to all values in that quantile.

这篇关于获取 SciPy 分位数以匹配 Stata xtile 函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆