pandas :为什么pandas.Series.std()不同于numpy.std() [英] Pandas: why pandas.Series.std() is different from numpy.std()

查看:268
本文介绍了 pandas :为什么pandas.Series.std()不同于numpy.std()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

另一个更新:已解决(请参阅评论和我自己的回答)。 更新:这正是我想解释的。

pd.Series([7,20,22,22])。std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575

答案:这可以通过 Bessel的更正 N-1来解释, code>而不是 N 作为标准偏差公式的分母。我希望Pandas使用与numpy相同的约定。






有一个相关的讨论这里,但他们的建议也不起作用。

我有很多不同餐馆的数据。这是我的数据框(想象不止一家餐厅,但效果只有一个): ; df
restaurant_id价格

1 10407 7
3 10407 20
6 10407 22
13 10407 22

问题: r.mi.groupby('restaurant_id')['price']。mean()返回每个餐厅的价格平均值。我想得到标准偏差。但是, r.mi.groupby('restaurant_id')['price']。std() 返回错误值



正如您所看到的,为了简单起见,我已经提取了仅有一家餐厅的四件物品。我想找到价格的标准差。只是为了确保:

 >>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575

我们可以使用

 >>>获得相同(正确)的值。 np.mean(df)
restaurant_id 10407.00
价格17.75
dtype:float64
>>> np.std(df)
restaurant_id 0.000000
价格6.259992
dtype:float64

(当然,忽视餐馆的平均ID)显然,当我有不止一家餐馆时, np.std(df)并不是一个解决方案。因此,我使用 groupby

 >>> df.groupby('restaurant_id')。agg('std')
price
restaurant_id
10407 7.228416

什么?!

 <$ c $ 7.228416不是6.259992。



我们再试一次。 C>>>> df.groupby('restaurant_id')。std()

同样的事情。

 >>> df.groupby('restaurant_id')['price']。std()

同样的事情

 >>> df.groupby('restaurant_id')。apply(lambda x:x.std())



然而,这样做:

 为id,组在df.groupby('restaurant_id')中:
打印ID,np.std(组['价格'])

问题:是否有合适的方法来聚合数据框,所以我会得到一个新的时间序列,每个餐馆都有标准差?

解决方案

我明白了。 Pandas默认使用贝塞尔校正 - 也就是说,标准偏差公式为 N-1 ,而不是分母中的 N 。正如behzad.nouri在评论中指出的,

  pd.Series([7,20,22,22])。 std(ddof = 0)== np.std([7,20,22,22])


Another update: resolved (see comments and my own answer).

Update: this is what I am trying to explain.

>>> pd.Series([7,20,22,22]).std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575

Answer: this is explained by Bessel's correction, N-1 instead of N in the denominator of the standard deviation formula. I wish Pandas used the same convention as numpy.


There is a related discussion here, but their suggestions do not work either.

I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one):

>>> df
restaurant_id  price
id                      
1           10407      7
3           10407     20
6           10407     22
13          10407     22

Question: r.mi.groupby('restaurant_id')['price'].mean() returns price means for each restaurant. I want to get the standard deviations. However, r.mi.groupby('restaurant_id')['price'].std() returns wrong values.

As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure:

>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575

We can get the same (correct) values with

>>> np.mean(df)
restaurant_id    10407.00
price               17.75
dtype: float64
>>> np.std(df)
restaurant_id    0.000000
price            6.259992
dtype: float64

(Of course, disregard the mean restaurant id.) Obviously, np.std(df) is not a solution when I have more than one restaurant. So I am using groupby.

>>> df.groupby('restaurant_id').agg('std')
                  price
restaurant_id          
10407          7.228416

What?! 7.228416 is not 6.259992.

Let's try again.

>>> df.groupby('restaurant_id').std()

Same thing.

>>> df.groupby('restaurant_id')['price'].std()

Same thing.

>>> df.groupby('restaurant_id').apply(lambda x: x.std())

Same thing.

However, this works:

for id, group in df.groupby('restaurant_id'):
  print id, np.std(group['price'])

Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?

解决方案

I see. Pandas is using Bessel's correction by default -- that is, the standard deviation formula with N-1 instead of N in the denominator. As behzad.nouri has pointed out in the comments,

pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])

这篇关于 pandas :为什么pandas.Series.std()不同于numpy.std()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆