价值计数的百分位数 [英] percentiles from counts of values

查看:89
本文介绍了价值计数的百分位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从Python中多个大型向量的集合中计算百分位数.而不是尝试连接向量,然后将生成的巨大向量放入

I want to calculate percentiles from an ensemble of multiple large vectors in Python. Instead of trying to concatenate the vectors and then putting the resulting huge vector through numpy.percentile, is there a more efficient way?

我的想法是,首先计算不同值的频率(例如,使用 scipy.stats.itemfreq ),其次,将不同向量的那些项频率结合起来,最后从计数中计算百分位.

My idea would be, first, counting the frequencies of different values (e.g. using scipy.stats.itemfreq), second, combining those item frequencies for the different vectors, and finally, calculating the percentiles from the counts.

不幸的是,我无法找到用于合并频率表的功能(这不是很简单,因为不同的表可能涵盖不同的项目),或者无法从项目频率表中计算百分位.我需要实现这些功能,还是可以使用现有的Python函数?这些功能是什么?

Unfortunately I haven't been able to find functions either for combining the frequency tables (it is not very simple, as different tables may cover different items), or for calculating percentiles from an item frequency table. Do I need to implement these, or can I use existing Python functions? What would those functions be?

推荐答案

遵循朱利安·帕拉德(Julien Palard)的建议,使用collections.Counter解决第一个问题(计算和合并频率表),第二个问题的实现(根据频率计算百分位数)表格):

Using collections.Counter for solving the first problem (calculating and combining frequency tables) following Julien Palard's suggestion, and my implementation for the second problem (calculating percentiles from frequency tables):

from collections import Counter

def calc_percentiles(cnts_dict, percentiles_to_calc=range(101)):
    """Returns [(percentile, value)] with nearest rank percentiles.
    Percentile 0: <min_value>, 100: <max_value>.
    cnts_dict: { <value>: <count> }
    percentiles_to_calc: iterable for percentiles to calculate; 0 <= ~ <= 100
    """
    assert all(0 <= p <= 100 for p in percentiles_to_calc)
    percentiles = []
    num = sum(cnts_dict.values())
    cnts = sorted(cnts_dict.items())
    curr_cnts_pos = 0  # current position in cnts
    curr_pos = cnts[0][1]  # sum of freqs up to current_cnts_pos
    for p in sorted(percentiles_to_calc):
        if p < 100:
            percentile_pos = p / 100.0 * num
            while curr_pos <= percentile_pos and curr_cnts_pos < len(cnts):
                curr_cnts_pos += 1
                curr_pos += cnts[curr_cnts_pos][1]
            percentiles.append((p, cnts[curr_cnts_pos][0]))
        else:
            percentiles.append((p, cnts[-1][0]))  # we could add a small value
    return percentiles

cnts_dict = Counter()
for segment in segment_iterator:
    cnts_dict += Counter(segment)

percentiles = calc_percentiles(cnts_dict)

这篇关于价值计数的百分位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆