pandas ：为什么pandas.Series.std（）不同于numpy.std（） [英] Pandas: why pandas.Series.std() is different from numpy.std()

查看：268 发布时间：2018/5/30 13:49:03 python numpy pandas group-by statistics

本文介绍了 pandas ：为什么pandas.Series.std（）不同于numpy.std（）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

另一个更新：已解决（请参阅评论和我自己的回答）。 更新：这正是我想解释的。

pd.Series（[7,20,22,22]）。std（）
7.2284161474004804
>>> np.std（[7,20,22,22]）
6.2599920127744575

答案：这可以通过 Bessel的更正， N-1来解释， code>而不是 N 作为标准偏差公式的分母。我希望Pandas使用与numpy相同的约定。
有一个相关的讨论这里，但他们的建议也不起作用。
我有很多不同餐馆的数据。这是我的数据框（想象不止一家餐厅，但效果只有一个）： ; df restaurant_id价格 1 10407 7 3 10407 20 6 10407 22 13 10407 22
问题： r.mi.groupby（'restaurant_id'）['price']。mean（）返回每个餐厅的价格平均值。我想得到标准偏差。但是， r.mi.groupby（'restaurant_id'）['price']。std（） 返回错误值
。

正如您所看到的，为了简单起见，我已经提取了仅有一家餐厅的四件物品。我想找到价格的标准差。只是为了确保：

>>> np.mean（[7,20,22,22]） 17.75 >>> np.std（[7,20,22,22]） 6.2599920127744575
我们可以使用

>>>获得相同（正确）的值。 np.mean（df） restaurant_id 10407.00 价格17.75 dtype：float64 >>> np.std（df） restaurant_id 0.000000 价格6.259992 dtype：float64
（当然，忽视餐馆的平均ID）显然，当我有不止一家餐馆时， np.std（df）并不是一个解决方案。因此，我使用 groupby 。
>>> df.groupby（'restaurant_id'）。agg（'std'） price restaurant_id 10407 7.228416
什么？！
<$ c $ 7.228416不是6.259992。

我们再试一次。 C>>>> df.groupby（'restaurant_id'）。std（）

同样的事情。
>>> df.groupby（'restaurant_id'）['price']。std（）
同样的事情
>>> df.groupby（'restaurant_id'）。apply（lambda x：x.std（））

然而，这样做：
为id，组在df.groupby（'restaurant_id'）中：打印ID，np.std（组['价格']）
问题：是否有合适的方法来聚合数据框，所以我会得到一个新的时间序列，每个餐馆都有标准差？
解决方案
我明白了。 Pandas默认使用贝塞尔校正 - 也就是说，标准偏差公式为 N-1 ，而不是分母中的 N 。正如behzad.nouri在评论中指出的，

pd.Series（[7,20,22,22]）。 std（ddof = 0）== np.std（[7,20,22,22]）

Another update: resolved (see comments and my own answer).

Update: this is what I am trying to explain.
>>> pd.Series([7,20,22,22]).std() 7.2284161474004804 >>> np.std([7,20,22,22]) 6.2599920127744575
Answer: this is explained by Bessel's correction, N-1 instead of N in the denominator of the standard deviation formula. I wish Pandas used the same convention as numpy.

There is a related discussion here, but their suggestions do not work either.

I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one):
>>> df restaurant_id price id 1 10407 7 3 10407 20 6 10407 22 13 10407 22
Question: r.mi.groupby('restaurant_id')['price'].mean() returns price means for each restaurant. I want to get the standard deviations. However, r.mi.groupby('restaurant_id')['price'].std() returns wrong values.

As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure:
>>> np.mean([7,20,22,22]) 17.75 >>> np.std([7,20,22,22]) 6.2599920127744575
We can get the same (correct) values with
>>> np.mean(df) restaurant_id 10407.00 price 17.75 dtype: float64 >>> np.std(df) restaurant_id 0.000000 price 6.259992 dtype: float64
(Of course, disregard the mean restaurant id.) Obviously, np.std(df) is not a solution when I have more than one restaurant. So I am using groupby.
>>> df.groupby('restaurant_id').agg('std') price restaurant_id 10407 7.228416
What?! 7.228416 is not 6.259992.

Let's try again.
>>> df.groupby('restaurant_id').std()
Same thing.
>>> df.groupby('restaurant_id')['price'].std()
Same thing.
>>> df.groupby('restaurant_id').apply(lambda x: x.std())
Same thing.

However, this works:
for id, group in df.groupby('restaurant_id'): print id, np.std(group['price'])
Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?
解决方案
I see. Pandas is using Bessel's correction by default -- that is, the standard deviation formula with N-1 instead of N in the denominator. As behzad.nouri has pointed out in the comments,
pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])

这篇关于 pandas ：为什么pandas.Series.std（）不同于numpy.std（）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas ：为什么pandas.Series.std（）不同于numpy.std（） [英] Pandas: why pandas.Series.std() is different from numpy.std()

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas ：为什么pandas.Series.std（）不同于numpy.std（） [英] Pandas: why pandas.Series.std() is different from numpy.std()

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭