如何从频率表计算百分位数? [英] How to compute percentiles from frequency table?

查看:727
本文介绍了如何从频率表计算百分位数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有CSV文件:

fr id
 1 10000152
 1 10000212
 1 10000847
 1 10001018
 2 10001052
 2 10001246
14 10001908
...........

这是一个频率表,其中id是整数变量,而fr是给定值的出现次数.文件按值升序排序. 我想计算变量的百分位数(即90%,80%,70%... 10%).

This is a frequency table, where id is integer variable and fr is number of occurrences given value. File is sorted ascending by value. I would like to compute percentiles (ie. 90%, 80%, 70% ... 10%) of variable.

我已经在纯Python中完成了此操作,类似于以下伪代码:

I have done this in pure Python, similar to this pseudocode:

bucket=sum(fr)/10.0
percentile=1
sum=0
for (current_fr, current_id) in zip(fr,id):
   sum=sum+current_fr
   if (sum > percentile*bucket):
      print "%i percentile: %i" % (percentile*10,current_id)
      percentile=percentile+1

但是这段代码非常原始:它没有考虑到百分位数应位于集合值之间,不能退后等.

But this code is very raw: it doesn't take into account that percentile should be between values from the set, it can't step back etc.

还有更优雅,通用的现成解决方案吗?

Is there any more elegant, universal, ready-made solution?

推荐答案

似乎您要累积fr的总和.你可以做

Seems like you want cumulative sum of fr. You can do

cumfr = [sum(fr[:i+1]) for i in range(len(fr))]

那么百分位数是

percentile = [100*i/cumfr[-1] for i in cumfr]

这篇关于如何从频率表计算百分位数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆