为什么 pandas.Series.std() 与 numpy.std() 不同? [英] Why is pandas.Series.std() different from numpy.std()?
问题描述
这就是我要解释的:
<预><代码>>>>pd.Series([7,20,22,22]).std()7.2284161474004804>>>np.std([7,20,22,22])6.2599920127744575答案:这由贝塞尔修正解释,N-1
而不是 N
在标准偏差公式的分母中.我希望 Pandas 使用与 numpy 相同的约定.
这里有一个相关的讨论,但他们的建议也不起作用.>
我有许多不同餐厅的数据.这是我的数据框(想象不止一家餐厅,但效果仅用一家重现):
<预><代码>>>>dfrestaurant_id 价格ID1 10407 73 10407 206 10407 2213 10407 22问题:r.mi.groupby('restaurant_id')['price'].mean()
返回每个餐厅的价格平均值.我想得到标准偏差.但是,r.mi.groupby('restaurant_id')['price'].std()
返回错误的值.
如您所见,为简单起见,我仅提取了一家包含四个项目的餐厅.我想找到价格的标准差.只是为了确保:
<预><代码>>>>np.mean([7,20,22,22])17.75>>>np.std([7,20,22,22])6.2599920127744575我们可以获得相同(正确)的值
<预><代码>>>>np.mean(df)restaurant_id 10407.00价格 17.75数据类型:float64>>>np.std(df)restaurant_id 0.000000价格 6.259992数据类型:float64(当然,忽略平均餐厅 id.)显然,当我拥有不止一家餐厅时,np.std(df)
不是解决方案.所以我使用 groupby
.
什么?!7.228416 不是 6.259992.
我们再试一次.
<预><代码>>>>df.groupby('restaurant_id').std()同样的事情.
<预><代码>>>>df.groupby('restaurant_id')['price'].std()同样的事情.
<预><代码>>>>df.groupby('restaurant_id').apply(lambda x: x.std())同样的事情.
但是,这是有效的:
for id, group in df.groupby('restaurant_id'):打印 id, np.std(group['price'])
问题:是否有合适的方法来聚合数据框,以便我获得一个新的时间序列,其中包含每个餐厅的标准差?
我明白了.Pandas 默认使用 Bessel's Correction -- 即带有 N-1
而不是分母中的 N
.正如 behzad.nouri 在评论中指出的那样,
pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])
This is what I am trying to explain:
>>> pd.Series([7,20,22,22]).std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575
Answer: this is explained by Bessel's correction, N-1
instead of N
in the denominator of the standard deviation formula. I wish Pandas used the same convention as numpy.
There is a related discussion here, but their suggestions do not work either.
I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one):
>>> df
restaurant_id price
id
1 10407 7
3 10407 20
6 10407 22
13 10407 22
Question: r.mi.groupby('restaurant_id')['price'].mean()
returns price means for each restaurant. I want to get the standard deviations. However, r.mi.groupby('restaurant_id')['price'].std()
returns wrong values.
As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure:
>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575
We can get the same (correct) values with
>>> np.mean(df)
restaurant_id 10407.00
price 17.75
dtype: float64
>>> np.std(df)
restaurant_id 0.000000
price 6.259992
dtype: float64
(Of course, disregard the mean restaurant id.) Obviously, np.std(df)
is not a solution when I have more than one restaurant. So I am using groupby
.
>>> df.groupby('restaurant_id').agg('std')
price
restaurant_id
10407 7.228416
What?! 7.228416 is not 6.259992.
Let's try again.
>>> df.groupby('restaurant_id').std()
Same thing.
>>> df.groupby('restaurant_id')['price'].std()
Same thing.
>>> df.groupby('restaurant_id').apply(lambda x: x.std())
Same thing.
However, this works:
for id, group in df.groupby('restaurant_id'):
print id, np.std(group['price'])
Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?
I see. Pandas is using Bessel's correction by default -- that is, the standard deviation formula with N-1
instead of N
in the denominator. As behzad.nouri has pointed out in the comments,
pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])
这篇关于为什么 pandas.Series.std() 与 numpy.std() 不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!