如何在Python中从频率分布表中获得均值和标准差 [英] How to get Mean and Standard deviation from a Frequency Distribution table in Python

查看:188
本文介绍了如何在Python中从频率分布表中获得均值和标准差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个元组列表[(val1,freq1),(val2,freq2)....(valn,freqn)].我需要获取上述数据的集中趋势的度量(均值,中位数)和偏差的度量(方差,均值).我还想为这些值绘制一个箱线图.

I have a list of tuples [(val1, freq1), (val2, freq2) .... (valn, freqn)]. I need to get measures of central tendencies (mean, median ) and measures of deviation (variance , mean) for the above data.I would also like to plot a boxplot for the values.

我看到numpy数组具有直接方法来从值列表中获取均值/中位数和标准差(或方差).

I see that numpy arrays have direct methods for getting mean / median and standard deviation (or variance) from list of values.

numpy(或任何其他知名的库)是否有直接方法可对这种频率分布表进行操作?

Does numpy (or any other well-known library) have a direct means to operate on such a frequency distribution table ?

另外,以编程方式将上述元组列表扩展为一个列表的最佳方法是什么? (例如,如果freq dist为[(1,3),(50,2)],则为获取列表[1,1,1,50,50]以使用np.mean([1,1,1, 50,50]))

Also What is the best way to programtically expand the above list of tuples to one list ? (e.g if freq dist is [(1,3) , (50,2)], best way to get a list [1,1,1,50,50] to use np.mean([1,1,1,50,50]))

我在此处看到了一个自定义函数,但是我想使用一个标准实施

I see a custom function here, but I would like to use a standard implementation if possible

推荐答案

首先,我将凌乱的列表更改为两个numpy数组,如@ user8153那样:

First, I'd change that messy list into two numpy arrays like @user8153 did:

val, freq = np.array(list_tuples).T

然后,您可以重建数组(使用np.repeat防止循环):

Then you can reconstruct the array (using np.repeat prevent looping):

data = np.repeat(val, freq)

并在您的计算机上使用 numpy统计函数 data数组.

And use numpy statistical functions on your data array.

如果这会导致内存错误(或者您只是想最大限度地提高性能),则还可以使用一些专用功能:

If that causes memory errors (or you just want to squeeze out as much performance as possible), you can also use some purpose-built functions:

def mean_(val, freq):
    return np.average(val, weights = freq)

def median_(val, freq):
    ord = np.argsort(val)
    cdf = np.cumsum(freq[ord])
    return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]

def mode_(val, freq): #in the strictest sense, assuming unique mode
    return val[np.argmax(freq)]

def var_(val, freq):
    avg = mean_(val, freq)
    dev = freq * (val - avg) ** 2
    return dev.sum() / (freq.sum() - 1)

def std_(val, freq):
    return np.sqrt(var_(val, freq))

这篇关于如何在Python中从频率分布表中获得均值和标准差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆