在 python 中使用 describe() 获取具有(分析)加权的描述性统计数据 [英] Getting descriptive statistics with (analytic) weighting using describe() in python
问题描述
我试图将代码从 Stata 翻译成 Python
I was trying to translate code from Stata to Python
Stata 中的原始代码:
The original code in Stata:
by year, sort : summarize age [aweight = wt]
通常一个简单的describe()
函数就可以了
Normally a simply describe()
function will do
dataframe.groupby("year")["age"].describe()
但我找不到将 aweight
选项翻译成 python 语言的方法,即在分析/方差加权下给出数据集的描述性统计.
But I could not find a way to translate the aweight
option into the language of python i.e. to give descriptive statistics of a dataset under analytic/ variance weighting.
在python中生成数据集的代码:
codes to generate the dataset in python:
dataframe = {'year': [2016,2016,2020, 2020], 'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]}
如果我按年份运行 ,排序:在 stata 上汇总年龄 [aweight = wt]
,
结果是:mean = 49.842 and SD = 16.37
the outcome is : mean =49.842 and SD = 16.37
我应该怎么做才能在 Python 中获得相同的结果?
What should I do to get the same outcome in Python?
推荐答案
所以我写了一个函数,它执行与 describe
相同的事情,除了接受一个权重参数.我在您提供的小数据帧上对其进行了测试,但没有详细介绍.我尽量不使用 .apply
以防你有一个大的数据框,尽管我没有运行基准测试来看看我的方法是否比编写一个函数来执行一个更快/更少的内存密集型对每个 by
组进行加权描述,然后使用 apply
将其应用于数据帧中的每个 by
组.那可能是最简单的.
So I wrote a function that performs the same thing as describe
except taking a weight argument. I tested it on the small dataframe you provided, but haven't gone into too much detail. I tried not to use .apply
in case you have a large dataframe, though I didn't run a bench mark to see if my approach would be faster/less memory intensive than writing a function to do a weighted describe for each by
group and then using apply
to apply that to each by
group in the dataframe. That would probably be easiest.
可以在不考虑权重的情况下采用计数、最小值和最大值.然后我做了简单的加权平均和标准差.偏差——来自无偏方差的公式.我包括了一个频率加权选项,它应该只影响用于将方差调整到无偏估计量的样本大小.频率权重应使用权重之和作为样本大小,否则使用数据中的计数.我使用这个答案来帮助获得加权百分位数.
Counts, min and max can be taken without regard to weighting. Then I did simple weighted mean and std. deviation--from formula for unbiased variance. I included an option for frequency weighting, which should just effect the sample size used to adjust the variance to the unbiased estimator. Frequency weights should use the sum of the weights as the sample size, otherwise, uses the count in the data. I used this answer to help get weighted percentiles.
import pandas as pd
import numpy as np
df = pd.DataFrame({'year': [2016,2016,2020, 2020],
'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]})
df
year age wt
0 2016 41 1.2
1 2016 65 0.7
2 2020 35 0.8
3 2020 28 1.5
然后我定义下面的函数.
Then I define the function below.
def weighted_groupby_describe(df, col, by, wt, frequency=False):
'''
df : dataframe
col : column for which you want statistics, must be single column
by : groupby column(s)
wt : column to use for weights
frequency : if True, use sample size as sum of weights (only effects degrees
of freedom correction for unbiased variance)
'''
if isinstance(by, list):
df = df.sort_values(by+[col])
else:
df = df.sort_values([by] + [col])
newcols = ['gb_weights', 'col_weighted', 'col_mean',
'col_sqdiff', 'col_sqdiff_weighted', 'gb_weights_cumsum', 'ngroup']
assert all([c not in df.columns for c in newcols])
df['gb_weights'] = df[wt]/df.groupby(by)[wt].transform('sum')
df['gb_weights_cumsum'] = df.groupby(by)['gb_weights'].cumsum()
df['col_weighted'] = df.eval('{}*gb_weights'.format(col))
df['col_mean'] = df.groupby(by)['col_weighted'].transform('sum')
df['col_sqdiff'] = df.eval('({}-col_mean)**2'.format(col))
df['col_sqdiff_weighted'] = df.eval('col_sqdiff*gb_weights')
wstd = df.groupby(by)['col_sqdiff_weighted'].sum()**(0.5)
wstd.name = 'std'
wmean = df.groupby(by)['col_weighted'].sum()
wmean.name = 'mean'
df['ngroup'] = df.groupby(by).ngroup()
quantiles = np.array([0.25, 0.5, 0.75])
weighted_quantiles = df['gb_weights_cumsum'] - 0.5*df['gb_weights'] + df['ngroup']
ngroups = df['ngroup'].max()+1
x = np.hstack([quantiles+i for i in range(ngroups)])
quantvals = np.interp(x, weighted_quantiles, df[col])
quantvals = np.reshape(quantvals, (ngroups, -1))
other = df.groupby(by)[col].agg(['min', 'max', 'count'])
stats = pd.concat([wmean, wstd, other], axis=1, sort=False)
stats['25%'] = quantvals[:, 0]
stats['50%'] = quantvals[:, 1]
stats['75%'] = quantvals[:, 2]
colorder = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
stats = stats[colorder]
if frequency:
sizes = df.groupby(by)[wt].sum()
else:
sizes = stats['count']
stats['weight'] = sizes
# use the "sample size" (weight) to obtain std. deviation from unbiased
# variance
stats['std'] = stats.eval('((std**2)*(weight/(weight-1)))**(1/2)')
return stats
然后测试一下.
weighted_groupby_describe(df, 'age', 'year', 'wt')
count mean std min ... 50% 75% max weight
year ...
2016 2 49.842105 16.372398 41 ... 49.842105 61.842105 65 2
2020 2 30.434783 4.714936 28 ... 30.434783 33.934783 35 2
将此与没有权重的输出进行比较.
Compare this to the output without the weights.
df.groupby('year')['age'].describe()
count mean std min 25% 50% 75% max
year
2016 2.0 53.0 16.970563 41.0 47.00 53.0 59.00 65.0
2020 2.0 31.5 4.949747 28.0 29.75 31.5 33.25 35.0
这篇关于在 python 中使用 describe() 获取具有(分析)加权的描述性统计数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!