计算STD手动使用GROUPBY pandas 数据框 [英] Calculate STD manually using Groupby Pandas DataFrame

查看:347
本文介绍了计算STD手动使用GROUPBY pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图写一个解决办法<一href="http://stackoverflow.com/questions/26599347/groupby-pandas-dataframe-and-calculate-mean-and-stdev-of-one-column-and-add-the">this问题通过提供不同的和人工的方式来计算一个均值和std 的。

I was trying to write a solution for this question by providing a different and a manual way to calculate a mean and std.

我创建了<一href="http://stackoverflow.com/questions/26599347/groupby-pandas-dataframe-and-calculate-mean-and-stdev-of-one-column-and-add-the">dataframe如在问题中所述

a= ["Apple","Banana","Cherry","Apple"]
b= [3,4,7,3]
c= [5,4,1,4]
d= [7,8,3,7]

import pandas as pd
df =  pd.DataFrame(index=range(4), columns=list("ABCD"))

df["A"]=a
df["B"]=b
df["C"]=c
df["D"]=d

然后,我创建了一个的名单不重复。然后,我通过项目去了,通过将每次的项目和计算解决方案。

Then, I created a list of A's without duplication. Then I went through the items, by grouping everytime the items and calculate the solution.

import numpy as np

l= list(set(df.A))

df.groupby('A', as_index=False)
listMean=[0]*len(df.C)
listSTD=[0]*len(df.C)

for x in l:
    s= np.mean(df[df['A']==x].C.values)
    z= [index for index, item in enumerate(df['A'].values) if x==item ]
    for i in z:
        listMean[i]=s

for x in l:
    s=  np.std(df[df['A']==x].C.values)
    z= [index for index, item in enumerate(df['A'].values) if x==item ]
    for i in z:
        listSTD[i]=s

df['C']= listMean
df['E']= listSTD

print df

我用描述()由A组合计算平均值,标准。

I used describe() grouped by "A" to calculate the mean, std.

print df.groupby('A').describe()

和测试了建议的解决方案:

And tested the suggested solution:

result = df.groupby(['a'], as_index=False).agg(
                      {'c':['mean','std'],'b':'first', 'd':'first'})

我注意到,我得到了不同的结果,当我计算STD(E)。我只是好奇,我错过了什么?

I noticed that I got different results when I calculate std ("E"). I am just curious, what did I miss ?

推荐答案

有的两种标准偏差的(SD):人口SD和样品SD

There are two kinds of standard deviations (SD): the population SD and the sample SD.

人口SD

时使用的值重新present值,你正在学习的整个宇宙。

is used when the values represent the entire universe of values that you are studying.

样品SD

时使用的值是从宇宙仅仅是样本。

is used when the values are a mere sample from that universe.

np.std 计算人口SD默认情况下,而大熊猫 Series.std 计算由样品SD默认值。

np.std calculates the population SD by default, while Pandas' Series.std calculates the sample SD by default.

In [42]: np.std([4,5])
Out[42]: 0.5

In [43]: np.std([4,5], ddof=0)
Out[43]: 0.5

In [44]: np.std([4,5], ddof=1)
Out[44]: 0.70710678118654757

In [45]: x = pd.Series([4,5])

In [46]: x.std()
Out[46]: 0.70710678118654757

In [47]: x.std(ddof=0)
Out[47]: 0.5

ddof 表示自由度,并控制在SD公式减去 N 的数量。

ddof stands for "degrees of freedom", and controls the number subtracted from N in the SD formulas.

该公式上面的图片来自这个维基百科页面。在那里,未修正的样本标准差,也就是我所谓的人口SD,以及修正样本标准差是样本标准差。

The formula images above come from this Wikipedia page. There the "uncorrected sample standard deviation" is what I called the population SD, and the "corrected sample standard deviation" is the sample SD.

这篇关于计算STD手动使用GROUPBY pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆