在具有加权数据的情况下使用describe()-均值,标准差,中位数,分位数 [英] Using describe() with weighted data -- mean, standard deviation, median, quantiles

查看:346
本文介绍了在具有加权数据的情况下使用describe()-均值,标准差,中位数,分位数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对python和pandas相当陌生(因为使用SAS作为我的主要分析平台),因此,如果已经被问到/回答过,我谨此表示歉意. (我已经在文档中搜索了该站点,也在该站点中搜索了答案,但是还找不到任何东西.)

I'm fairly new to python and pandas (from using SAS as my workhorse analytical platform), so I apologize in advance if this has already been asked / answered. (I've searched through the documentation as well as this site searching for answer and haven't been able to find something yet.)

我有一个包含响应者水平调查数据的数据框(称为resp).我想对其中一个领域进行一些基本的描述性统计(称为anninc [年收入的简称]).

I've got a dataframe (called resp) containing respondent level survey data. I want to perform some basic descriptive statistics on one of the fields (called anninc [short for annual income]).

resp["anninc"].describe()

哪个提供了我的基本统计信息:

Which gives me the basic stats:

count     76310.000000
mean      43455.874862
std       33154.848314
min           0.000000
25%       20140.000000
50%       34980.000000
75%       56710.000000
max      152884.330000
dtype: float64

但是有一个陷阱.考虑到样本的构建方式,有必要对受访者数据进行权重调整,以便在执行分析时并非每个人都被视为相等".我在数据框中有另一列(称为tufnwgrp),该列代表在分析过程中应应用于每个记录的权重.

But there's a catch. Given how the sample was built, there was a need to weight adjust the respondent data so that not every one is deemed as "equal" when performing the analysis. I have another column in the dataframe (called tufnwgrp) that represents the weight that should be applied to each record during the analysis.

在我以前的SAS生活中,大多数proc都具有处理权重这样的数据的选项.例如,给出相同结果的标准proc单变量看起来像这样:

In my prior SAS life, most of the proc's have options to process data with weights like this. For example, a standard proc univariate to give the same results would look something like this:

proc univariate data=resp;
  var anninc;
  output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count;
run;

使用加权数据进行的相同分析看起来像这样:

And the same analysis using weighted data would look something like this:

proc univariate data=resp;
  var anninc;
  weight tufnwgrp;
  output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count
run;

熊猫中是否存在类似describe()等方法的类似加权选项?

Is there a similar sort of weighting option available in pandas for methods like describe() etc?

推荐答案

似乎可以处理统计信息和计量经济学库(statsmodels).这是一个示例,它扩展了@MSeifert的答案此处,用于类似问题.

There is statistics and econometrics library (statsmodels) that appears to handle this. Here's an example that extends @MSeifert's answer here on a similar question.

df=pd.DataFrame({ 'x':range(1,101), 'wt':range(1,101) })

from statsmodels.stats.weightstats import DescrStatsW
wdf = DescrStatsW(df.x, weights=df.wt, ddof=1) 

print( wdf.mean )
print( wdf.std )
print( wdf.quantile([0.25,0.50,0.75]) )


67.0
23.6877840059
p
0.25    50
0.50    71
0.75    87

我不使用SAS,但这给出了与stata命令相同的答案:

I don't use SAS, but this gives the same answer as the stata command:

sum x [fw=wt], detail

Stata实际上有一些权重选项,在这种情况下,如果您指定aw(分析权重)而不是fw(频率权重),则答案会稍有不同.同样,stata要求fw是整数,而DescrStatsW允许使用非整数权重.权重比您想象的要复杂...这已经开始杂草丛生,但是对于计算标准差的权重问题进行了大量讨论

Stata actually has a few weight options and in this case gives a slightly different answer if you specify aw (analytical weights) instead of fw (frequency weights). Also, stata requires fw to be an integer whereas DescrStatsW allows non-integer weights. Weights are more complicated than you'd think... This is starting to get into the weeds, but there is a great discussion of weighting issues for calculating the standard deviation here.

还请注意,DescrStatsW似乎没有包括min和max的函数,但是只要权重不为零,这就不会成为问题,因为权重不会影响min和max.但是,如果您的权重确实为零,则对min和max进行加权可能会很好,但是在熊猫中也很容易计算:

Also note that DescrStatsW does not appear to include functions for min and max, but as long as your weights are non-zero this should not be a problem as the weights don't affect the min and max. However, if you did have some zero weights, it might be nice to have weighted min and max, but it's also easy to calculate in pandas:

df.x[ df.wt > 0 ].min()
df.x[ df.wt > 0 ].max()

这篇关于在具有加权数据的情况下使用describe()-均值,标准差,中位数,分位数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆