为什么 pandas.Series.std() 与 numpy.std() 不同? [英] Why is pandas.Series.std() different from numpy.std()?

查看:41
本文介绍了为什么 pandas.Series.std() 与 numpy.std() 不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这就是我要解释的:

<预><代码>>>>pd.Series([7,20,22,22]).std()7.2284161474004804>>>np.std([7,20,22,22])6.2599920127744575

答案:这由贝塞尔修正解释,N-1 而不是 N 在标准偏差公式的分母中.我希望 Pandas 使用与 numpy 相同的约定.


这里有一个相关的讨论,但他们的建议也不起作用.>

我有许多不同餐厅的数据.这是我的数据框(想象不止一家餐厅,但效果仅用一家重现):

<预><代码>>>>dfrestaurant_id 价格ID1 10407 73 10407 206 10407 2213 10407 22

问题:r.mi.groupby('restaurant_id')['price'].mean() 返回每个餐厅的价格平均值.我想得到标准偏差.但是,r.mi.groupby('restaurant_id')['price'].std() 返回错误的值.

如您所见,为简单起见,我仅提取了一家包含四个项目的餐厅.我想找到价格的标准差.只是为了确保:

<预><代码>>>>np.mean([7,20,22,22])17.75>>>np.std([7,20,22,22])6.2599920127744575

我们可以获得相同(正确)的值

<预><代码>>>>np.mean(df)restaurant_id 10407.00价格 17.75数据类型:float64>>>np.std(df)restaurant_id 0.000000价格 6.259992数据类型:float64

(当然,忽略平均餐厅 id.)显然,当我拥有不止一家餐厅时,np.std(df) 不是解决方案.所以我使用 groupby.

<预><代码>>>>df.groupby('restaurant_id').agg('std')价钱餐厅编号10407 7.228416

什么?!7.228416 不是 6.259992.

我们再试一次.

<预><代码>>>>df.groupby('restaurant_id').std()

同样的事情.

<预><代码>>>>df.groupby('restaurant_id')['price'].std()

同样的事情.

<预><代码>>>>df.groupby('restaurant_id').apply(lambda x: x.std())

同样的事情.

但是,这是有效的:

for id, group in df.groupby('restaurant_id'):打印 id, np.std(group['price'])

问题:是否有合适的方法来聚合数据框,以便我获得一个新的时间序列,其中包含每个餐厅的标准差?

解决方案

我明白了.Pandas 默认使用 Bessel's Correction -- 即带有 N-1 而不是分母中的 N.正如 behzad.nouri 在评论中指出的那样,

pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])

This is what I am trying to explain:

>>> pd.Series([7,20,22,22]).std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575

Answer: this is explained by Bessel's correction, N-1 instead of N in the denominator of the standard deviation formula. I wish Pandas used the same convention as numpy.


There is a related discussion here, but their suggestions do not work either.

I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one):

>>> df
restaurant_id  price
id                      
1           10407      7
3           10407     20
6           10407     22
13          10407     22

Question: r.mi.groupby('restaurant_id')['price'].mean() returns price means for each restaurant. I want to get the standard deviations. However, r.mi.groupby('restaurant_id')['price'].std() returns wrong values.

As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure:

>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575

We can get the same (correct) values with

>>> np.mean(df)
restaurant_id    10407.00
price               17.75
dtype: float64
>>> np.std(df)
restaurant_id    0.000000
price            6.259992
dtype: float64

(Of course, disregard the mean restaurant id.) Obviously, np.std(df) is not a solution when I have more than one restaurant. So I am using groupby.

>>> df.groupby('restaurant_id').agg('std')
                  price
restaurant_id          
10407          7.228416

What?! 7.228416 is not 6.259992.

Let's try again.

>>> df.groupby('restaurant_id').std()

Same thing.

>>> df.groupby('restaurant_id')['price'].std()

Same thing.

>>> df.groupby('restaurant_id').apply(lambda x: x.std())

Same thing.

However, this works:

for id, group in df.groupby('restaurant_id'):
  print id, np.std(group['price'])

Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?

解决方案

I see. Pandas is using Bessel's correction by default -- that is, the standard deviation formula with N-1 instead of N in the denominator. As behzad.nouri has pointed out in the comments,

pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])

这篇关于为什么 pandas.Series.std() 与 numpy.std() 不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆