pandas 用 NaN 描述 vs scipy.stats percentileofscore? [英] Pandas describe vs scipy.stats percentileofscore with NaN?

查看:106
本文介绍了 pandas 用 NaN 描述 vs scipy.stats percentileofscore?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个奇怪的情况,我认为 pd.describe 给了我与 scipy.stats percentileofscore 不一致的百分位标记,因为我认为是 NaN.

I'm having a weird situation, where pd.describe is giving me percentile markers that disagree with scipy.stats percentileofscore, because of NaNs, I think.

我的 df 是:

      f_recommend
0     3.857143
1     4.500000
2     4.458333
3          NaN
4     3.600000
5          NaN
6     4.285714
7     3.587065
8     4.200000
9          NaN

当我运行 df.describe(percentiles=[.25, .5, .75]) 我得到:

       f_recommend
count     7.000000
mean      4.069751
std       0.386990
min       3.587065
25%       3.728571
50%       4.200000
75%       4.372024
max       4.500000

当我在移除 NaN 的情况下运行时得到相同的值.

I get the same values when I run with NaN removed.

然而,当我想查找特定值时,当我运行 scipy.stats.percentileofscore(df['f_recommend'], 3.61, kind = 'mean') 我得到:28th百分位数与 NaN 和第 20 没有.

When I want to look up a specific value, however, when I run scipy.stats.percentileofscore(df['f_recommend'], 3.61, kind = 'mean') I get: 28th percentile with NaN and 20th without.

有什么想法可以解释这种差异吗?

Any thoughts to explain this discrepancy?

预计到达时间:

我不认为问题在于我们计算百分位数的方式不同.因为这仅在您以不同方式计算相同 2 个数字的百分位数时才重要.但在这里,将 25 个百分位数描述为 3.72.所以 3.61 绝对不可能是第 28 个百分位数.没有一个公式应该给出.

I don't believe that the problem is that we're calculating percentiles differently. Because that only matters when you're calculating percentiles of the same 2 numbers in different ways. But here, describe gives 25 percentile as 3.72. So there is absolutely no way that 3.61 can be 28th percentile. None of the formulas should give that.

特别是,当我在没有 NaN 的 6 个值上使用 describe 时,我得到了相同的值,所以忽略了 NaN,这很好.但是当我在没有 NaN 的情况下运行百分位数时,我得到一个不匹配的数字.

In particular, when I use describe on the 6 values without NaN, I get the same values, so that's ignoring NaN, which is fine. But when I run percentile of score without the NaN I get a number that doesn't match.

预计到达时间 2:

更简单的例子:

In [48]: d = pd.DataFrame([1,2,3,4,5,6,7])

In [49]: d.describe()
Out[49]: 
              0
count  7.000000
mean   4.000000
std    2.160247
min    1.000000
25%    2.500000
50%    4.000000
75%    5.500000
max    7.000000

In [50]: sp.stats.percentileofscore(d[0], 2.1, kind = 'mean')
Out[50]: 28.571428571428573

种类"参数无关紧要,因为 2.1 是独一无二的.

the "kind" argument doesn't matter because 2.1 is unique.

推荐答案

scipy.stats.percentileofscore 不会忽略 nan,也不会检查值和句柄它以某种特殊的方式.它只是数据中的另一个浮点值.这意味着 percentileofscore 与包含 nan 的数据的行为是未定义的,因为 nan 在比较中的行为:

scipy.stats.percentileofscore does not ignore nan, nor does it check for the value and handle it in some special way. It is just another floating point value in your data. This means the behavior of percentileofscore with data containing nan is undefined, because of the behavior of nan in comparisons:

In [44]: np.nan > 0
Out[44]: False

In [45]: np.nan < 0
Out[45]: False

In [46]: np.nan == 0
Out[46]: False

In [47]: np.nan == np.nan
Out[47]: False

那些结果都是正确的——这就是 nan 应该表现的方式.但这意味着,为了知道 percentileofscore 如何处理 nan,您必须知道代码如何进行比较.这是一个您不应该知道的实现细节,并且您不能依赖于在 scipy 的未来版本中保持相同.

Those results are all correct--that is how nan is supposed to behave. But that means, in order to know how percentileofscore handles nan, you have to know how the code does comparisons. And that is an implementation detail that you shouldn't have to know, and that you can't rely on to be the same in future versions of scipy.

如果您调查 percentfileofscore 的行为,您会发现它的行为好像 nan 是无限的.例如,如果您将 nan 替换为大于输入中任何其他值的值,您将得到相同的结果:

If you investigate the behavior of percentfileofscore, you'll find that it behaves as if nan was infinite. For example, if you replace nan with a value larger than any other value in the input, you'll get the same results:

In [53]: percentileofscore([10, 20, 25, 30, np.nan, np.nan], 18)
Out[53]: 16.666666666666664

In [54]: percentileofscore([10, 20, 25, 30, 999, 999], 18)
Out[54]: 16.666666666666664

不幸的是,您不能依赖这种行为.如果未来的实现发生变化,nan 可能最终表现得像负无穷大,或者有一些其他未指定的行为.

Unfortunately, you can't rely on this behavior. If the implementation changes in the future, nan might end up behaving like negative infinity, or have some other unspecified behavior.

解决这个问题"很简单:不要给 percentileofscore 任何 nan 值.您必须先清理数据.请注意,这可以很简单:

The solution to this "problem" is simple: don't give percentileofscore any nan values. You'll have to clean up your data first. Note that this can be as simple as:

result = percentileofscore(a[~np.isnan(a)], score)

这篇关于 pandas 用 NaN 描述 vs scipy.stats percentileofscore?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆