Pandas diff SeriesGroupBy 比较慢 [英] Pandas diff SeriesGroupBy is relatively slow

查看:58
本文介绍了Pandas diff SeriesGroupBy 比较慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Total time: 1.01876 s
Function: prepare at line 91

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    91                                           @profile
    92                                           def prepare():
    93                                           
    94         1       5681.0   5681.0      0.6     
    95         1       2416.0   2416.0      0.2      
    96                                           
    97                                               
    98         1        536.0    536.0      0.1      tss = df.groupby('user_id').timestamp
    99         1     949643.0 949643.0     93.2      delta = tss.diff()
   100         1       1822.0   1822.0      0.2      
   101         1      13030.0  13030.0      1.3      
   102         1       5193.0   5193.0      0.5      
   103         1       1251.0   1251.0      0.1      
   104                                           
   105         1       2038.0   2038.0      0.2      
   106                                           
   107         1       1851.0   1851.0      0.2     
   108                                           
   109         1        282.0    282.0      0.0      
   110                                           
   111         1       3088.0   3088.0      0.3      
   112         1       2943.0   2943.0      0.3      
   113         1        438.0    438.0      0.0      
   114         1       4658.0   4658.0      0.5      
   115         1      17083.0  17083.0      1.7      
   116         1       3115.0   3115.0      0.3      
   117         1       3691.0   3691.0      0.4      
   118                                           
   119         1          2.0      2.0      0.0      

我有一个数据框,我按某个键对其进行分组,然后从每组中选择一列并对该列(每组)执行差异.如分析结果所示,diff 操作与其他操作相比非常慢,是一种瓶颈.这是预期的吗?是否有更快的替代方法来达到相同的结果?

I have a dataframe which I group by some key and then select a column from each group and perform diff on that column (per group). As shown in the profiling results, the diff operation is super slow compared to the rest and is kind of a bottleneck. Is this expected? Are there faster alternatives to achieve the same result?

更多解释在我的用例中,时间戳表示我想要计算这些操作之间的差值的用户某些操作的时间(它们已排序),但每个用户的操作完全独立于其他用户.

some more explanation In my use case timestamps represent the times for some actions of a user to which I want to calculate the deltas between these actions (they are sorted) but each user's actions are completely independent of other users.

示例代码

import pandas as pd
import numpy as np


df = pd.DataFrame(
    {'ts':[1,2,3,4,60,61,62,63,64,150,155,156,
           1,2,3,4,60,61,62,63,64,150,155,163,
           1,2,3,4,60,61,62,63,64,150,155,183],
    'id': [1,2,3,4,60,61,62,63,64,150,155,156,
           71,72,73,74,80,81,82,83,64,160,165,166,
           21,22,23,24,90,91,92,93,94,180,185,186],
    'other':['x','x','x','','x','x','','x','x','','x','',
             'y','y','y','','y','y','','y','y','','y','',
             'z','z','z','','z','z','','z','z','','z',''],
    'user':['x','x','x','x','x','x','x','x','z','x','x','y',
            'y','y','y','y','y','y','y','y','x','y','y','x',
            'z','z','z','z','z','z','z','z','y','z','z','z']
    })



df.set_index('id',inplace=True)
deltas=df.groupby('user').ts.transform(pd.Series.diff)

推荐答案

如果您不希望对数据进行排序或下拉到 numpy,那么通过更改您的user 系列到分类.分类数据有效地存储为整数指针.

If you do not wish to sort your data or drop down to numpy, then a significant performance improvement may be possible by changing your user series to Categorical. Categorical data is stored effectively as integer pointers.

在下面的示例中,我看到从 86 毫秒改进到 59 毫秒.对于更大的数据集和重复使用更多用户的情况,这可能会进一步改进.

In the below example, I see an improvement from 86ms to 59ms. This may improve further for larger datasets and where more users are repeated.

df = pd.concat([df]*10000)

%timeit df.groupby('user').ts.transform(pd.Series.diff)  # 86.1 ms per loop

%timeit df['user'].astype('category')                    # 23.4 ms per loop
df['user'] = df['user'].astype('category')
%timeit df.groupby('user').ts.transform(pd.Series.diff)  # 35.7 ms per loop

如果您正在执行多项操作,那么转换为分类的一次性成本可以打折.

If you are performing multiple operations, then the one-off cost of converting to categorical can be discounted.

这篇关于Pandas diff SeriesGroupBy 比较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆