使用groupby后在Pandas中计算np.diff会导致意外结果 [英] Computing np.diff in Pandas after using groupby leads to unexpected result
问题描述
我有一个数据框,并且我试图向其添加一列顺序差异.我发现了一种我非常喜欢的方法(并且对于我的用例而言,它具有很好的概括性).但是我一路上注意到了一件奇怪的事情.你能帮我弄清楚吗?
I've got a dataframe, and I'm trying to append a column of sequential differences to it. I have found a method that I like a lot (and generalizes well for my use case). But I noticed one weird thing along the way. Can you help me make sense of it?
以下是一些具有正确结构的数据(在答案此处)上模拟的代码)
Here is some data that has the right structure (code modeled on an answer here):
import pandas as pd
import numpy as np
import random
from itertools import product
random.seed(1) # so you can play along at home
np.random.seed(2) # ditto
# make a list of dates for a few periods
dates = pd.date_range(start='2013-10-01', periods=4).to_native_types()
# make a list of tickers
tickers = ['ticker_%d' % i for i in range(3)]
# make a list of all the possible (date, ticker) tuples
pairs = list(product(dates, tickers))
# put them in a random order
random.shuffle(pairs)
# exclude a few possible pairs
pairs = pairs[:-3]
# make some data for all of our selected (date, ticker) tuples
values = np.random.rand(len(pairs))
mydates, mytickers = zip(*pairs)
data = pd.DataFrame({'date': mydates, 'ticker': mytickers, 'value':values})
好的,太好了.这给了我这样的框架:
Ok, great. This gives me a frame like so:
date ticker value
0 2013-10-03 ticker_2 0.435995
1 2013-10-04 ticker_2 0.025926
2 2013-10-02 ticker_1 0.549662
3 2013-10-01 ticker_0 0.435322
4 2013-10-02 ticker_2 0.420368
5 2013-10-03 ticker_0 0.330335
6 2013-10-04 ticker_1 0.204649
7 2013-10-02 ticker_0 0.619271
8 2013-10-01 ticker_2 0.299655
我的目标是向此数据框添加一个新列,其中将包含顺序更改.数据需要这样做,但是排序和差异需要逐行代码"完成,以便另一个代码中的间隔不会导致给定代码的NA.我想做到这一点而不会以任何其他方式干扰数据帧(即,我不希望根据进行区分的必要性对结果的数据帧进行重新排序).以下代码有效:
My goal is to add a new column to this dataframe that will contain sequential changes. The data needs to be in order to do this, but the ordering and the differencing needs to be done "ticker-wise" so that gaps in another ticker don't cause NA's for a given ticker. I want to do this without perturbing the dataframe in any other way (i.e. I do not want the resulting DataFrame to be reordered based on what was necessary to do the differencing). The following code works:
data1 = data.copy() #let's leave the original data alone for later experiments
data1.sort(['ticker', 'date'], inplace=True)
data1['diffs'] = data1.groupby(['ticker'])['value'].transform(lambda x: x.diff())
data1.sort_index(inplace=True)
data1
并返回:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0.015627
1 2013-10-04 ticker_2 0.025926 -0.410069
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 0.120713
5 2013-10-03 ticker_0 0.330335 -0.288936
6 2013-10-04 ticker_1 0.204649 -0.345014
7 2013-10-02 ticker_0 0.619271 0.183949
8 2013-10-01 ticker_2 0.299655 NaN
到目前为止,太好了.如果我将上面的中间行替换为此处显示的更简洁的代码,则一切仍然有效:
So far, so good. If I replace the middle line above with the more concise code shown here, everything still works:
data2 = data.copy()
data2.sort(['ticker', 'date'], inplace=True)
data2['diffs'] = data2.groupby('ticker')['value'].diff()
data2.sort_index(inplace=True)
data2
快速检查表明,实际上data1
等于data2
.但是,如果我这样做:
A quick check shows that, in fact, data1
is equal to data2
. However, if I do this:
data3 = data.copy()
data3.sort(['ticker', 'date'], inplace=True)
data3['diffs'] = data3.groupby('ticker')['value'].transform(np.diff)
data3.sort_index(inplace=True)
data3
我得到一个奇怪的结果:
I get a strange result:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0
1 2013-10-04 ticker_2 0.025926 NaN
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 NaN
5 2013-10-03 ticker_0 0.330335 0
6 2013-10-04 ticker_1 0.204649 NaN
7 2013-10-02 ticker_0 0.619271 NaN
8 2013-10-01 ticker_2 0.299655 0
这是怎么回事?当您在Pandas对象上调用.diff
方法时,它不仅会调用np.diff
吗?我知道DataFrame
类上有一个diff
方法,但是如果没有使data1
正常工作的lambda
函数语法,我无法弄清楚如何将该方法传递给transform
.我想念什么吗?为什么data3
中的diffs
列是螺旋形的?如何在transform
中调用熊猫diff
方法而无需编写lambda
来做到这一点?
What's going on here? When you call the .diff
method on a Pandas object, is it not just calling np.diff
? I know there's a diff
method on the DataFrame
class, but I couldn't figure out how to pass that to transform
without the lambda
function syntax I used to make data1
work. Am I missing something? Why is the diffs
column in data3
screwy? How can I have call the Pandas diff
method within transform
without needing to write a lambda
to do it?
推荐答案
很好重现示例!更多问题应该是这样!
Nice easy to reproduce example!! more questions should be like this!
只需传递一个lambda即可进行转换(这等于直接传递函子对象,例如np.diff(或Series.diff).因此,这等效于data1/data2
Just pass a lambda to transform (this is tantamount to passing afuncton object, e.g. np.diff (or Series.diff) directly. So this equivalent to data1/data2
In [32]: data3['diffs'] = data3.groupby('ticker')['value'].transform(Series.diff)
In [34]: data3.sort_index(inplace=True)
In [25]: data3
Out[25]:
date ticker value diffs
0 2013-10-03 ticker_2 0.435995 0.015627
1 2013-10-04 ticker_2 0.025926 -0.410069
2 2013-10-02 ticker_1 0.549662 NaN
3 2013-10-01 ticker_0 0.435322 NaN
4 2013-10-02 ticker_2 0.420368 0.120713
5 2013-10-03 ticker_0 0.330335 -0.288936
6 2013-10-04 ticker_1 0.204649 -0.345014
7 2013-10-02 ticker_0 0.619271 0.183949
8 2013-10-01 ticker_2 0.299655 NaN
[9 rows x 4 columns]
我相信np.diff
不会遵循numpy自己的unfunc准则来处理数组输入(从而尝试各种方法来强制输入并发送输出,例如,输入__array__
在输出__array_wrap__
上).我不太确定为什么,请在此处中了解更多信息.因此,最重要的是np.diff
不能正确处理索引并进行自己的计算(在这种情况下是错误的).
I believe that np.diff
doesn't follow numpy's own unfunc guidelines to process array inputs (whereby it tries various methods to coerce input and send output, e.g. __array__
on input __array_wrap__
on output). I am not really sure why, see a bit more info here. So bottom line is that np.diff
is not dealing with the index properly and doing its own calculation (which in this case is wrong).
Pandas有很多方法,它们不仅仅调用numpy函数,主要是因为它们处理不同的dtypes,处理nans,并且在这种情况下,处理特殊"差异.例如您可以将时间频率传递给datelike-index,它可以计算实际求和多少n.
Pandas has a lot of methods where they don't just call the numpy function, mainly because they handle different dtypes, handle nans, and in this case, handle 'special' diffs. e.g. you can pass a time frequency to a datelike-index where it calculates how many n to actually diff.
这篇关于使用groupby后在Pandas中计算np.diff会导致意外结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!