在 python 中使用 describe() 获取具有(分析)加权的描述性统计数据 [英] Getting descriptive statistics with (analytic) weighting using describe() in python

查看:58
本文介绍了在 python 中使用 describe() 获取具有(分析)加权的描述性统计数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将代码从 Stata 翻译成 Python

I was trying to translate code from Stata to Python

Stata 中的原始代码:

The original code in Stata:

by year, sort : summarize age [aweight = wt]

通常一个简单的describe()函数就可以了

Normally a simply describe() function will do

dataframe.groupby("year")["age"].describe()

但我找不到将 aweight 选项翻译成 python 语言的方法,即在分析/方差加权下给出数据集的描述性统计.

But I could not find a way to translate the aweight option into the language of python i.e. to give descriptive statistics of a dataset under analytic/ variance weighting.

在python中生成数据集的代码:

codes to generate the dataset in python:

dataframe = {'year': [2016,2016,2020, 2020], 'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]}

如果我按年份运行 ,排序:在 stata 上汇总年龄 [aweight = wt]

结果是:mean = 49.842 and SD = 16.37

the outcome is : mean =49.842 and SD = 16.37

我应该怎么做才能在 Python 中获得相同的结果?

What should I do to get the same outcome in Python?

推荐答案

所以我写了一个函数,它执行与 describe 相同的事情,除了接受一个权重参数.我在您提供的小数据帧上对其进行了测试,但没有详细介绍.我尽量不使用 .apply 以防你有一个大的数据框,尽管我没有运行基准测试来看看我的方法是否比编写一个函数来执行一个更快/更少的内存密集型对每个 by 组进行加权描述,然后使用 apply 将其应用于数据帧中的每个 by 组.那可能是最简单的.

So I wrote a function that performs the same thing as describe except taking a weight argument. I tested it on the small dataframe you provided, but haven't gone into too much detail. I tried not to use .apply in case you have a large dataframe, though I didn't run a bench mark to see if my approach would be faster/less memory intensive than writing a function to do a weighted describe for each by group and then using apply to apply that to each by group in the dataframe. That would probably be easiest.

可以在不考虑权重的情况下采用计数、最小值和最大值.然后我做了简单的加权平均和标准差.偏差——来自无偏方差的公式.我包括了一个频率加权选项,它应该只影响用于将方差调整到无偏估计量的样本大小.频率权重应使用权重之和作为样本大小,否则使用数据中的计数.我使用这个答案来帮助获得加权百分位数.

Counts, min and max can be taken without regard to weighting. Then I did simple weighted mean and std. deviation--from formula for unbiased variance. I included an option for frequency weighting, which should just effect the sample size used to adjust the variance to the unbiased estimator. Frequency weights should use the sum of the weights as the sample size, otherwise, uses the count in the data. I used this answer to help get weighted percentiles.

import pandas as pd
import numpy as np

df = pd.DataFrame({'year': [2016,2016,2020, 2020], 
                        'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]})
df
   year  age   wt
0  2016   41  1.2
1  2016   65  0.7
2  2020   35  0.8
3  2020   28  1.5

然后我定义下面的函数.

Then I define the function below.

def weighted_groupby_describe(df, col, by, wt, frequency=False):
    '''
    df : dataframe
    col : column for which you want statistics, must be single column
    by : groupby column(s)
    wt : column to use for weights
    frequency : if True, use sample size as sum of weights (only effects degrees
    of freedom correction for unbiased variance)
    '''
    
    if isinstance(by, list):
        df = df.sort_values(by+[col])
    else:
        df = df.sort_values([by] + [col])
    
    newcols = ['gb_weights', 'col_weighted', 'col_mean', 
        'col_sqdiff', 'col_sqdiff_weighted', 'gb_weights_cumsum', 'ngroup']
    assert all([c not in df.columns for c in newcols])
    
    df['gb_weights'] = df[wt]/df.groupby(by)[wt].transform('sum')
    
    df['gb_weights_cumsum'] = df.groupby(by)['gb_weights'].cumsum()
    
    df['col_weighted'] = df.eval('{}*gb_weights'.format(col))
    
    df['col_mean'] = df.groupby(by)['col_weighted'].transform('sum')
    
    df['col_sqdiff'] = df.eval('({}-col_mean)**2'.format(col))
    df['col_sqdiff_weighted'] = df.eval('col_sqdiff*gb_weights')
    
    wstd = df.groupby(by)['col_sqdiff_weighted'].sum()**(0.5)
    wstd.name = 'std'
    
    wmean = df.groupby(by)['col_weighted'].sum()
    wmean.name = 'mean'
    
    df['ngroup'] = df.groupby(by).ngroup()
    quantiles = np.array([0.25, 0.5, 0.75])
    weighted_quantiles = df['gb_weights_cumsum'] - 0.5*df['gb_weights'] + df['ngroup']
    ngroups = df['ngroup'].max()+1
    x = np.hstack([quantiles+i for i in range(ngroups)])
    quantvals = np.interp(x, weighted_quantiles, df[col])
    quantvals = np.reshape(quantvals, (ngroups, -1))
    
    other = df.groupby(by)[col].agg(['min', 'max', 'count'])
    
    stats = pd.concat([wmean, wstd, other], axis=1, sort=False)
    
    stats['25%'] = quantvals[:, 0]
    stats['50%'] = quantvals[:, 1]
    stats['75%'] = quantvals[:, 2]
    
    colorder = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
    stats = stats[colorder]
    
    if frequency:
        sizes = df.groupby(by)[wt].sum()
    else:
        sizes = stats['count']
    
    stats['weight'] = sizes
    
    # use the "sample size" (weight) to obtain std. deviation from unbiased
    # variance
    stats['std'] = stats.eval('((std**2)*(weight/(weight-1)))**(1/2)')
    
    return stats

然后测试一下.

weighted_groupby_describe(df, 'age', 'year', 'wt')
      count       mean        std  min  ...        50%        75%  max  weight
year                                    ...                                   
2016      2  49.842105  16.372398   41  ...  49.842105  61.842105   65       2
2020      2  30.434783   4.714936   28  ...  30.434783  33.934783   35       2

将此与没有权重的输出进行比较.

Compare this to the output without the weights.

df.groupby('year')['age'].describe()
      count  mean        std   min    25%   50%    75%   max
year                                                        
2016    2.0  53.0  16.970563  41.0  47.00  53.0  59.00  65.0
2020    2.0  31.5   4.949747  28.0  29.75  31.5  33.25  35.0

这篇关于在 python 中使用 describe() 获取具有(分析)加权的描述性统计数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆