pandas :为什么pandas.Series.std()不同于numpy.std() [英] Pandas: why pandas.Series.std() is different from numpy.std()
问题描述
另一个更新:已解决(请参阅评论和我自己的回答)。 更新:这正是我想解释的。
pd.Series([7,20,22,22])。std()7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575
答案:这可以通过 Bessel的更正, N-1来解释, code>而不是
N
作为标准偏差公式的分母。我希望Pandas使用与numpy相同的约定。
有一个相关的讨论这里,但他们的建议也不起作用。
我有很多不同餐馆的数据。这是我的数据框(想象不止一家餐厅,但效果只有一个): ; df
restaurant_id价格
1 10407 7
3 10407 20
6 10407 22
13 10407 22
问题: r.mi.groupby('restaurant_id')['price']。mean()
返回每个餐厅的价格平均值。我想得到标准偏差。但是, r.mi.groupby('restaurant_id')['price']。std()
返回错误值
正如您所看到的,为了简单起见,我已经提取了仅有一家餐厅的四件物品。我想找到价格的标准差。只是为了确保:
>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575
我们可以使用
>>>获得相同(正确)的值。 np.mean(df)
restaurant_id 10407.00
价格17.75
dtype:float64
>>> np.std(df)
restaurant_id 0.000000
价格6.259992
dtype:float64
(当然,忽视餐馆的平均ID)显然,当我有不止一家餐馆时, np.std(df)
并不是一个解决方案。因此,我使用 groupby
。
>>> df.groupby('restaurant_id')。agg('std')
price
restaurant_id
10407 7.228416
什么?!
<$ c $ 7.228416不是6.259992。
我们再试一次。 C>>>> df.groupby('restaurant_id')。std()
同样的事情。
>>> df.groupby('restaurant_id')['price']。std()
同样的事情
>>> df.groupby('restaurant_id')。apply(lambda x:x.std())
然而,这样做:
为id,组在df.groupby('restaurant_id')中:
打印ID,np.std(组['价格'])
问题:是否有合适的方法来聚合数据框,所以我会得到一个新的时间序列,每个餐馆都有标准差?
我明白了。 Pandas默认使用贝塞尔校正 - 也就是说,标准偏差公式为 N-1
,而不是分母中的 N
。正如behzad.nouri在评论中指出的,
pd.Series([7,20,22,22])。 std(ddof = 0)== np.std([7,20,22,22])
Another update: resolved (see comments and my own answer).
Update: this is what I am trying to explain.
>>> pd.Series([7,20,22,22]).std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575
Answer: this is explained by Bessel's correction, N-1
instead of N
in the denominator of the standard deviation formula. I wish Pandas used the same convention as numpy.
There is a related discussion here, but their suggestions do not work either.
I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one):
>>> df
restaurant_id price
id
1 10407 7
3 10407 20
6 10407 22
13 10407 22
Question: r.mi.groupby('restaurant_id')['price'].mean()
returns price means for each restaurant. I want to get the standard deviations. However, r.mi.groupby('restaurant_id')['price'].std()
returns wrong values.
As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure:
>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575
We can get the same (correct) values with
>>> np.mean(df)
restaurant_id 10407.00
price 17.75
dtype: float64
>>> np.std(df)
restaurant_id 0.000000
price 6.259992
dtype: float64
(Of course, disregard the mean restaurant id.) Obviously, np.std(df)
is not a solution when I have more than one restaurant. So I am using groupby
.
>>> df.groupby('restaurant_id').agg('std')
price
restaurant_id
10407 7.228416
What?! 7.228416 is not 6.259992.
Let's try again.
>>> df.groupby('restaurant_id').std()
Same thing.
>>> df.groupby('restaurant_id')['price'].std()
Same thing.
>>> df.groupby('restaurant_id').apply(lambda x: x.std())
Same thing.
However, this works:
for id, group in df.groupby('restaurant_id'):
print id, np.std(group['price'])
Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?
I see. Pandas is using Bessel's correction by default -- that is, the standard deviation formula with N-1
instead of N
in the denominator. As behzad.nouri has pointed out in the comments,
pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])
这篇关于 pandas :为什么pandas.Series.std()不同于numpy.std()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!