pandas 频率表中的描述性统计数据 [英] Descriptive stats from frequency table in pandas
问题描述
我有一个测试成绩的频率表:
I have a frequency table of test scores:
score count
----- -----
77 1105
78 940
79 1222
80 4339
etc
我想显示样本的基本统计数据和箱线图,该样本由频率表汇总. (例如,上面示例的平均值为79.16,中位数为80.)
I want to show basic statistics and a boxplot for the sample which is summarized by the frequency table. (For example, the mean of the above example is 79.16 and the median is 80.)
在熊猫市有办法吗?我所看到的所有示例均假设有个别案例的表格.
Is there a way to do this in Pandas? All the examples I have seen assume a table of individual cases.
我想我可以生成一个个人分数列表,像这样-
I suppose I could generate a list of individual scores, like this --
In [2]: s = pd.Series([77] * 1105 + [78] * 940 + [79] * 1222 + [80] * 4339)
In [3]: s.describe()
Out[3]:
count 7606.000000
mean 79.156324
std 1.118439
min 77.000000
25% 78.000000
50% 80.000000
75% 80.000000
max 80.000000
dtype: float64
-但我希望避免这种情况;真实的非玩具数据集中的总频率高达十亿.
-- but I am hoping to avoid that; total frequencies in the real non-toy dataset are well up in the billions.
任何帮助表示赞赏.
(我认为这是与使用带有加权数据的describe()不同的问题,这是对各个案例施加权重.)
(I think this is a different question from Using describe() with weighted data, which is about applying weights to individual cases.)
推荐答案
下面是一个小函数,用于计算频率分布的描述性统计量:
Here's a small function that calculates decriptive statistics for frequency distributions:
# from __future__ import division (for Python 2)
def descriptives_from_agg(values, freqs):
values = np.array(values)
freqs = np.array(freqs)
arg_sorted = np.argsort(values)
values = values[arg_sorted]
freqs = freqs[arg_sorted]
count = freqs.sum()
fx = values * freqs
mean = fx.sum() / count
variance = ((freqs * values**2).sum() / count) - mean**2
variance = count / (count - 1) * variance # dof correction for sample variance
std = np.sqrt(variance)
minimum = np.min(values)
maximum = np.max(values)
cumcount = np.cumsum(freqs)
Q1 = values[np.searchsorted(cumcount, 0.25*count)]
Q2 = values[np.searchsorted(cumcount, 0.50*count)]
Q3 = values[np.searchsorted(cumcount, 0.75*count)]
idx = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
result = pd.Series([count, mean, std, minimum, Q1, Q2, Q3, maximum], index=idx)
return result
演示:
np.random.seed(0)
val = np.random.normal(100, 5, 1000).astype(int)
pd.Series(val).describe()
Out:
count 1000.000000
mean 99.274000
std 4.945845
min 84.000000
25% 96.000000
50% 99.000000
75% 103.000000
max 113.000000
dtype: float64
vc = pd.value_counts(val)
descriptives_from_agg(vc.index, vc.values)
Out:
count 1000.000000
mean 99.274000
std 4.945845
min 84.000000
25% 96.000000
50% 99.000000
75% 103.000000
max 113.000000
dtype: float64
请注意,这不能处理NaN,并且未经适当测试.
Note that this doesn't handle NaN's and is not properly tested.
这篇关于 pandas 频率表中的描述性统计数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!