pandas :与之前的价值有所不同 [英] Pandas: Difference to previous value
问题描述
给出一个看起来像这样的熊猫数据框
Given a Pandas Data Frame that looks like this
GROUP VALUE MASK
1 5 false
2 10 false
2 20 false
1 7 true
3 17 false
3 18 false
1 100 false
1 200 true
对于MASK为真的每一行,我希望将其与该组中先前值的差,即结果应为
For every row where the MASK is true I would like to get the difference to the previous value within this group, i.e., the result should be
VALUE DIFF
7 2
200 100
如何在Pandas中计算?
How can I compute this in Pandas?
对于具有约200万行和100万组的大型数据帧,我该如何快速计算呢?
How can I compute this fast for a large data frame with about 2 million rows and 1 million groups?
推荐答案
实际上,瓶颈是groupby
.您实际上不需要使用groupby
来解决此特定问题.要通过GROUP
将数据帧sort
进行排序,请对已排序的数据帧执行diff
,通过MASK
的filter
应当可以.在排序前后,我们必须使用kind='mergesort'
来保持组内顺序不变,
Actually the bottleneck is groupby
. You don't actually need to use groupby
for this specific problem. To sort
the dataframe by GROUP
, perform diff
on the sorted dataframe and filter
by MASK
should be okay. We must use kind='mergesort'
to keep the order unchanged within the group before and after sorting,
假设每个组的第一个元素的MASK始终为False(因为第一个元素对差异运算无意义),您可以使用此功能
Assume MASK is always False for the first element of each group ( since the first element is meaningless for difference operation ), you can use this
pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
性能测试:
MAXN = 200000
GROUPS = 10000
df = pd.DataFrame({"GROUP": np.ceil(np.random.rand(MAXN)*GROUPS), "VALUE": np.ceil(np.random.rand(MAXN)*10000), "MASK":np.floor(np.random.rand(MAXN)*2).astype("bool")})
%timeit t1 = pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
# 1 loop, best of 3: 1.28 s per loop
%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#10 loops, best of 3: 63.1 ms per loop
#MAXN = 2000000
#GROUPS = 1000000
%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#1 loop, best of 3: 1.24 s per loop
这篇关于 pandas :与之前的价值有所不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!