percentile_cont和percentile_disc都没有在PostgreSQL 9.6.3中计算所需的第75个百分位数 [英] Neither percentile_cont nor percentile_disc are calculating the desired 75th percentile in PostgreSQL 9.6.3

查看:245
本文介绍了percentile_cont和percentile_disc都没有在PostgreSQL 9.6.3中计算所需的第75个百分位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用百分位功能,但没有得到所需的输出。我会说不正确,但功能可能按预期工作,而我只是对它们的理解不正确。

Working with the percentile functions, but I am not getting the desired output. I would say "incorrect", but the functions are probably working as they are intended, and I am just not understanding them properly.

这些是我正在使用的数字:

These are the numbers I am working with:

n = 32

160000
202800
240000
250000
265000
280000
285000
300000
300000
300000
300000
300000
309000
325000
350000
358625
364999.92
393750
400000
420000
425000
450000
450000
463500
475000
475000
505808
525000
550000
567300
665000
900000

我对 percentile_cont 的理解是,如果计数为偶数,它将聚合两个数字它将添加它们,然后除以二。我对 percentile_disc 的理解是,如果计数为偶数,它将只选择最低的数字。

My understanding of percentile_cont is that it will aggregate two numbers if the count is even in that it will add them and then divide by two. My understanding of percentile_disc is that it will just select the lowest number if the count is even.

这是我的了解使用第50个(中位数)作为示例来计算百分位数的方法:

This is my understanding of calculating a percentile using the 50th (median) as an example:

如果数字(n)为奇数,则在中间选择一个数字;如果数字是偶数,则取中间两个数字的平均值。因此,在这种情况下,有32个数字,因此中位数= (358625 + 364999.92)/ 2 = 361812.46 percentile_cont 返回正确的值,因为它将两个值取平均值。 percentile_disc 返回错误的值,因为它选择了两者中的最小值。

If the number of numbers (n) is odd, pick the number in the middle; if the number is even, you average the two numbers in the middle. So in this case, there are 32 numbers, so the median = (358625 + 364999.92) / 2 = 361812.46. percentile_cont returns the correct value since it averages the two values; percentile_disc returns the incorrect value since it picks the lowest of the two.

关于其他百分位,例如第十位,我的理解是将百分位数乘以数字(n)即可得到索引:在这种情况下, .10 * 32 = 3.2索引。然后应该将您舍入到最接近的整数,这就是您的百分位数值。如果索引是整数,则将索引中的数字与紧随其后的数字进行平均。

Regarding other percentiles, the 10th for example, my understanding is you multiple the percentile by the number of numbers (n) to get the index: .10 * 32 = 3.2 index in this case. You are then supposed to round up to the nearest whole number and that is your percentile value. If the index is a whole number, then you average the number in the index with the number right after it.

在这种情况下, percentile_cont 是错误的,因为它返回 251500 ,这甚至不是我可以到达的数字。我能得到的最接近的平均值是 24000、250000、265000 ,即 251666.67 percentile_disc 返回正确的结果 250000

In that case, percentile_cont is wrong because it returns 251500 which isn't even a number I can arrive at. The closest I can get is averaging 24000, 250000, 265000 which is 251666.67. percentile_disc returns the correct result of 250000.

但真正的踢球者是这个:第75位。根据我的计算,它应该返回 469250 index =(32 * .75)= 24 ,该索引应得出(463500 + 475000)= 469250 percentile_disc 返回 463500 ; percentile_cont 返回 466375 ,在我的一生中,我再也无法获得该数字。

But the real kicker is this one: the 75th. It should return 469250 according to my calculations. index = (32*.75) = 24, and that index should result in (463500 + 475000) = 469250. percentile_disc returns 463500; percentile_cont returns 466375, which again I can't arrive at that number for the life of me.

这是我的查询:

SELECT 
    itemcode, 
    COUNT(itemcode) AS n, 
    PERCENTILE_DIST(0.10) WITHIN GROUP (ORDER BY price) AS 10th,
    PERCENTILE_DIST(0.25) WITHIN GROUP (ORDER BY price) AS 25th,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY price) AS median,
    AVG(price) AS mean,
    PERCENTILE_DIST(0.65) WITHIN GROUP (ORDER BY price) AS 65th,
    PERCENTILE_DIST(0.75) WITHIN GROUP (ORDER BY price) AS 75th,
    PERCENTILE_DIST(0.90) WITHIN GROUP (ORDER BY price) AS 90th
FROM items
WHERE itemcode = 26 AND removed IS NULL
GROUP BY itemcode;

注意:在任何情况下,都没有删除不是 NULL

Note: there are no cases where removed is not NULL.

我需要怎么做才能使其正常工作并保持一致?我是否需要编写一个首先检查 n 的函数,然后再确定哪个 percentile_disc percentile_cont 是基于偶数还是奇数?

What do I need to do to get this working correctly and with consistency? Do I need to write a function that checks n first before to decide which percentile_disc or percentile_cont based on whether it is even or odd?

SQL Fiddle: http://sqlfiddle.com/#!17/aa09c/9

SQL Fiddle: http://sqlfiddle.com/#!17/aa09c/9

推荐答案

将此问题发布到Reddit并获得了一些帮助。

Posted this question to Reddit and was able to get some help.

显然, percentile_cont 函数,除了Excel中的 percentile percentile.inc 函数外,还使用线性的C = 1变体进行计算插值,如本Wikipedia中所述:

Apparently, the percentile_cont function, in addition to percentile and percentile.inc functions in Excel, calculate using the C=1 variant of linear interpolation as explained in this Wikipedia:

https://en.wikipedia.org/wiki/Percentile#Second_variant.2C_.7F.27.22.60UNIQ--postMath-00000043-QINU .60.22.27.7F

显然,我

因此PostgreSQL的本机函数不能很好地工作,需要创建一个自定义函数,该函数将在发布时发布我做完。 (我怀疑它将使用9.4之前的旧方法 ntile ,但仍在研究之中)。

So the native functions of PostgreSQL won't work so well and will need to make a custom function which I will post when I am done. (I suspect it will use the old ntile method from before 9.4, but still looking into it).

但是无论如何,这就是为什么它关闭了。

But anyway, that is why it is off.

这篇关于percentile_cont和percentile_disc都没有在PostgreSQL 9.6.3中计算所需的第75个百分位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆