pandas :与之前的价值有所不同 [英] Pandas: Difference to previous value

查看:79
本文介绍了 pandas :与之前的价值有所不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个看起来像这样的熊猫数据框

Given a Pandas Data Frame that looks like this

GROUP   VALUE    MASK
  1        5     false
  2       10     false
  2       20     false
  1        7     true
  3        17    false
  3        18    false
  1        100   false
  1        200   true

对于MASK为真的每一行,我希望将其与该组中先前值的差,即结果应为

For every row where the MASK is true I would like to get the difference to the previous value within this group, i.e., the result should be

VALUE DIFF
  7      2
  200  100

如何在Pandas中计算?

How can I compute this in Pandas?

对于具有约200万行和100万组的大型数据帧,我该如何快速计算呢?

How can I compute this fast for a large data frame with about 2 million rows and 1 million groups?

推荐答案

实际上,瓶颈是groupby.您实际上不需要使用groupby来解决此特定问题.要通过GROUP将数据帧sort进行排序,请对已排序的数据帧执行diff,通过MASKfilter应当可以.在排序前后,我们必须使用kind='mergesort'来保持组内顺序不变,

Actually the bottleneck is groupby. You don't actually need to use groupby for this specific problem. To sort the dataframe by GROUP, perform diff on the sorted dataframe and filter by MASK should be okay. We must use kind='mergesort' to keep the order unchanged within the group before and after sorting,

假设每个组的第一个元素的MASK始终为False(因为第一个元素对差异运算无意义),您可以使用此功能

Assume MASK is always False for the first element of each group ( since the first element is meaningless for difference operation ), you can use this

pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]

性能测试:

MAXN = 200000
GROUPS = 10000
df = pd.DataFrame({"GROUP": np.ceil(np.random.rand(MAXN)*GROUPS), "VALUE": np.ceil(np.random.rand(MAXN)*10000), "MASK":np.floor(np.random.rand(MAXN)*2).astype("bool")})

%timeit t1 = pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
# 1 loop, best of 3: 1.28 s per loop

%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#10 loops, best of 3: 63.1 ms per loop

#MAXN = 2000000
#GROUPS = 1000000
%timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
#1 loop, best of 3: 1.24 s per loop

这篇关于 pandas :与之前的价值有所不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆