使用groupby后在Pandas中计算np.diff会导致意外结果 [英] Computing np.diff in Pandas after using groupby leads to unexpected result

查看:65
本文介绍了使用groupby后在Pandas中计算np.diff会导致意外结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,并且我试图向其添加一列顺序差异.我发现了一种我非常喜欢的方法(并且对于我的用例而言,它具有很好的概括性).但是我一路上注意到了一件奇怪的事情.你能帮我弄清楚吗?

I've got a dataframe, and I'm trying to append a column of sequential differences to it. I have found a method that I like a lot (and generalizes well for my use case). But I noticed one weird thing along the way. Can you help me make sense of it?

以下是一些具有正确结构的数据(在答案此处)上模拟的代码)

Here is some data that has the right structure (code modeled on an answer here):

import pandas as pd
import numpy as np
import random
from itertools import product

random.seed(1)       # so you can play along at home
np.random.seed(2)    # ditto

# make a list of dates for a few periods
dates = pd.date_range(start='2013-10-01', periods=4).to_native_types()
# make a list of tickers
tickers = ['ticker_%d' % i for i in range(3)]
# make a list of all the possible (date, ticker) tuples
pairs = list(product(dates, tickers))
# put them in a random order
random.shuffle(pairs)
# exclude a few possible pairs
pairs = pairs[:-3]
# make some data for all of our selected (date, ticker) tuples
values = np.random.rand(len(pairs))

mydates, mytickers = zip(*pairs)
data = pd.DataFrame({'date': mydates, 'ticker': mytickers, 'value':values})

好的,太好了.这给了我这样的框架:

Ok, great. This gives me a frame like so:

     date        ticker      value
0    2013-10-03  ticker_2    0.435995
1    2013-10-04  ticker_2    0.025926
2    2013-10-02  ticker_1    0.549662
3    2013-10-01  ticker_0    0.435322
4    2013-10-02  ticker_2    0.420368
5    2013-10-03  ticker_0    0.330335
6    2013-10-04  ticker_1    0.204649
7    2013-10-02  ticker_0    0.619271
8    2013-10-01  ticker_2    0.299655

我的目标是向此数据框添加一个新列,其中将包含顺序更改.数据需要这样做,但是排序和差异需要逐行代码"完成,以便另一个代码中的间隔不会导致给定代码的NA.我想做到这一点而不会以任何其他方式干扰数据帧(即,我不希望根据进行区分的必要性对结果的数据帧进行重新排序).以下代码有效:

My goal is to add a new column to this dataframe that will contain sequential changes. The data needs to be in order to do this, but the ordering and the differencing needs to be done "ticker-wise" so that gaps in another ticker don't cause NA's for a given ticker. I want to do this without perturbing the dataframe in any other way (i.e. I do not want the resulting DataFrame to be reordered based on what was necessary to do the differencing). The following code works:

data1 = data.copy() #let's leave the original data alone for later experiments
data1.sort(['ticker', 'date'], inplace=True)
data1['diffs'] = data1.groupby(['ticker'])['value'].transform(lambda x: x.diff())
data1.sort_index(inplace=True)
data1

并返回:

     date        ticker      value       diffs
0    2013-10-03  ticker_2    0.435995    0.015627
1    2013-10-04  ticker_2    0.025926   -0.410069
2    2013-10-02  ticker_1    0.549662    NaN
3    2013-10-01  ticker_0    0.435322    NaN
4    2013-10-02  ticker_2    0.420368    0.120713
5    2013-10-03  ticker_0    0.330335   -0.288936
6    2013-10-04  ticker_1    0.204649   -0.345014
7    2013-10-02  ticker_0    0.619271    0.183949
8    2013-10-01  ticker_2    0.299655    NaN

到目前为止,太好了.如果我将上面的中间行替换为此处显示的更简洁的代码,则一切仍然有效:

So far, so good. If I replace the middle line above with the more concise code shown here, everything still works:

data2 = data.copy()
data2.sort(['ticker', 'date'], inplace=True)
data2['diffs'] = data2.groupby('ticker')['value'].diff()
data2.sort_index(inplace=True)
data2

快速检查表明,实际上data1等于data2.但是,如果我这样做:

A quick check shows that, in fact, data1 is equal to data2. However, if I do this:

data3 = data.copy()
data3.sort(['ticker', 'date'], inplace=True)
data3['diffs'] = data3.groupby('ticker')['value'].transform(np.diff)
data3.sort_index(inplace=True)
data3

我得到一个奇怪的结果:

I get a strange result:

     date        ticker     value       diffs
0    2013-10-03  ticker_2    0.435995    0
1    2013-10-04  ticker_2    0.025926   NaN
2    2013-10-02  ticker_1    0.549662   NaN
3    2013-10-01  ticker_0    0.435322   NaN
4    2013-10-02  ticker_2    0.420368   NaN
5    2013-10-03  ticker_0    0.330335    0
6    2013-10-04  ticker_1    0.204649   NaN
7    2013-10-02  ticker_0    0.619271   NaN
8    2013-10-01  ticker_2    0.299655    0

这是怎么回事?当您在Pandas对象上调用.diff方法时,它不仅会调用np.diff吗?我知道DataFrame类上有一个diff方法,但是如果没有使data1正常工作的lambda函数语法,我无法弄清楚如何将该方法传递给transform.我想念什么吗?为什么data3中的diffs列是螺旋形的?如何在transform中调用熊猫diff方法而无需编写lambda来做到这一点?

What's going on here? When you call the .diff method on a Pandas object, is it not just calling np.diff? I know there's a diff method on the DataFrame class, but I couldn't figure out how to pass that to transform without the lambda function syntax I used to make data1 work. Am I missing something? Why is the diffs column in data3 screwy? How can I have call the Pandas diff method within transform without needing to write a lambda to do it?

推荐答案

很好重现示例!更多问题应该是这样!

Nice easy to reproduce example!! more questions should be like this!

只需传递一个lambda即可进行转换(这等于直接传递函子对象,例如np.diff(或Series.diff).因此,这等效于data1/data2

Just pass a lambda to transform (this is tantamount to passing afuncton object, e.g. np.diff (or Series.diff) directly. So this equivalent to data1/data2

In [32]: data3['diffs'] = data3.groupby('ticker')['value'].transform(Series.diff)

In [34]: data3.sort_index(inplace=True)

In [25]: data3
Out[25]: 
         date    ticker     value     diffs
0  2013-10-03  ticker_2  0.435995  0.015627
1  2013-10-04  ticker_2  0.025926 -0.410069
2  2013-10-02  ticker_1  0.549662       NaN
3  2013-10-01  ticker_0  0.435322       NaN
4  2013-10-02  ticker_2  0.420368  0.120713
5  2013-10-03  ticker_0  0.330335 -0.288936
6  2013-10-04  ticker_1  0.204649 -0.345014
7  2013-10-02  ticker_0  0.619271  0.183949
8  2013-10-01  ticker_2  0.299655       NaN

[9 rows x 4 columns]

我相信np.diff不会遵循numpy自己的unfunc准则来处理数组输入(从而尝试各种方法来强制输入并发送输出,例如,输入__array__在输出__array_wrap__上).我不太确定为什么,请在此处中了解更多信息.因此,最重要的是np.diff不能正确处理索引并进行自己的计算(在这种情况下是错误的).

I believe that np.diff doesn't follow numpy's own unfunc guidelines to process array inputs (whereby it tries various methods to coerce input and send output, e.g. __array__ on input __array_wrap__ on output). I am not really sure why, see a bit more info here. So bottom line is that np.diff is not dealing with the index properly and doing its own calculation (which in this case is wrong).

Pandas有很多方法,它们不仅仅调用numpy函数,主要是因为它们处理不同的dtypes,处理nans,并且在这种情况下,处理特殊"差异.例如您可以将时间频率传递给datelike-index,它可以计算实际求和多少n.

Pandas has a lot of methods where they don't just call the numpy function, mainly because they handle different dtypes, handle nans, and in this case, handle 'special' diffs. e.g. you can pass a time frequency to a datelike-index where it calculates how many n to actually diff.

这篇关于使用groupby后在Pandas中计算np.diff会导致意外结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆