如何在Python中从频率分布表中获得均值和标准差 [英] How to get Mean and Standard deviation from a Frequency Distribution table in Python
问题描述
我有一个元组列表[(val1,freq1),(val2,freq2)....(valn,freqn)].我需要获取上述数据的集中趋势的度量(均值,中位数)和偏差的度量(方差,均值).我还想为这些值绘制一个箱线图.
I have a list of tuples [(val1, freq1), (val2, freq2) .... (valn, freqn)]. I need to get measures of central tendencies (mean, median ) and measures of deviation (variance , mean) for the above data.I would also like to plot a boxplot for the values.
我看到numpy数组具有直接方法来从值列表中获取均值/中位数和标准差(或方差).
I see that numpy arrays have direct methods for getting mean / median and standard deviation (or variance) from list of values.
numpy(或任何其他知名的库)是否有直接方法可对这种频率分布表进行操作?
Does numpy (or any other well-known library) have a direct means to operate on such a frequency distribution table ?
另外,以编程方式将上述元组列表扩展为一个列表的最佳方法是什么? (例如,如果freq dist为[(1,3),(50,2)],则为获取列表[1,1,1,50,50]以使用np.mean([1,1,1, 50,50]))
Also What is the best way to programtically expand the above list of tuples to one list ? (e.g if freq dist is [(1,3) , (50,2)], best way to get a list [1,1,1,50,50] to use np.mean([1,1,1,50,50]))
我在此处看到了一个自定义函数,但是我想使用一个标准实施
I see a custom function here, but I would like to use a standard implementation if possible
推荐答案
首先,我将凌乱的列表更改为两个numpy
数组,如@ user8153那样:
First, I'd change that messy list into two numpy
arrays like @user8153 did:
val, freq = np.array(list_tuples).T
然后,您可以重建数组(使用np.repeat
防止循环):
Then you can reconstruct the array (using np.repeat
prevent looping):
data = np.repeat(val, freq)
并在您的计算机上使用 numpy
统计函数 data
数组.
And use numpy
statistical functions on your data
array.
如果这会导致内存错误(或者您只是想最大限度地提高性能),则还可以使用一些专用功能:
If that causes memory errors (or you just want to squeeze out as much performance as possible), you can also use some purpose-built functions:
def mean_(val, freq):
return np.average(val, weights = freq)
def median_(val, freq):
ord = np.argsort(val)
cdf = np.cumsum(freq[ord])
return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]
def mode_(val, freq): #in the strictest sense, assuming unique mode
return val[np.argmax(freq)]
def var_(val, freq):
avg = mean_(val, freq)
dev = freq * (val - avg) ** 2
return dev.sum() / (freq.sum() - 1)
def std_(val, freq):
return np.sqrt(var_(val, freq))
这篇关于如何在Python中从频率分布表中获得均值和标准差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!