如何从频率表计算百分位数? [英] How to compute percentiles from frequency table?
问题描述
我有CSV文件:
fr id
1 10000152
1 10000212
1 10000847
1 10001018
2 10001052
2 10001246
14 10001908
...........
这是一个频率表,其中id
是整数变量,而fr
是给定值的出现次数.文件按值升序排序.
我想计算变量的百分位数(即90%,80%,70%... 10%).
This is a frequency table, where id
is integer variable and fr
is number of occurrences given value. File is sorted ascending by value.
I would like to compute percentiles (ie. 90%, 80%, 70% ... 10%) of variable.
我已经在纯Python中完成了此操作,类似于以下伪代码:
I have done this in pure Python, similar to this pseudocode:
bucket=sum(fr)/10.0
percentile=1
sum=0
for (current_fr, current_id) in zip(fr,id):
sum=sum+current_fr
if (sum > percentile*bucket):
print "%i percentile: %i" % (percentile*10,current_id)
percentile=percentile+1
但是这段代码非常原始:它没有考虑到百分位数应位于集合值之间,不能退后等.
But this code is very raw: it doesn't take into account that percentile should be between values from the set, it can't step back etc.
还有更优雅,通用的现成解决方案吗?
Is there any more elegant, universal, ready-made solution?
推荐答案
似乎您要累积fr
的总和.你可以做
Seems like you want cumulative sum of fr
. You can do
cumfr = [sum(fr[:i+1]) for i in range(len(fr))]
那么百分位数是
percentile = [100*i/cumfr[-1] for i in cumfr]
这篇关于如何从频率表计算百分位数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!