提高 Pandas 合并性能 [英] Improve Pandas Merge performance

查看:24
本文介绍了提高 Pandas 合并性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我特别没有 Pands Merge 的性能问题,正如其他帖子所暗示的那样,但我有一个类,其中有很多方法,可以对数据集进行大量合并.

I specifically dont have performace issue with Pands Merge, as other posts suggest, but I've a class in which there are lot of methods, which does a lot of merge on datasets.

班级有大约 10 个小组,大约有 15 个合并.虽然 groupby 非常快,但在 1.5 秒的总执行时间中,这 15 次合并调用大约需要 0.7 秒.

The class has around 10 group by and around 15 merge. While groupby is pretty fast, out of total execution time of 1.5 seconds for class, around 0.7 seconds goes in those 15 merge calls.

我想在这些合并调用中提高性能.因为我将有大约 4000 次迭代,因此在单次迭代中总共节省 0.5 秒将导致整体性能降低大约 30 分钟,这会很棒.

I want to speed up performace in those merge calls. As I will have around 4000 iterations, hence saving .5 seconds overall in single iteration will lead to overall performance reduction by around 30min, which will be great.

有什么我应该尝试的建议吗?我试过:赛通Numba,而 Numba 更慢.

Any suggestions I should try? I tried: Cython Numba, and Numba was slower.

谢谢

添加示例代码片段:我的合并语句:

Edit 1: Adding sample code snippets: My merge statements:

tmpDf = pd.merge(self.data, t1, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t2, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t3, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t4, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t5, on='APPT_NBR', how='left')

并且,通过实施连接,我合并了以下声明:

And, by implementing Joins, I incorporate the following satatements:

dat = self.data.set_index('APPT_NBR')

t1.set_index('APPT_NBR', inplace=True)
t2.set_index('APPT_NBR', inplace=True)
t3.set_index('APPT_NBR', inplace=True)
t4.set_index('APPT_NBR', inplace=True)
t5.set_index('APPT_NBR', inplace=True)

tmpDf = dat.join(t1, how='left')
tmpDf = tmpDf.join(t2, how='left')
tmpDf = tmpDf.join(t3, how='left')
tmpDf = tmpDf.join(t4, how='left')
tmpDf = tmpDf.join(t5, how='left')

tmpDf.reset_index(inplace=True)

注意,所有都是名为:def merge_earlier_created_values(self):

而且,当我通过以下方式从 profilehook 进行定时呼叫时:

And, when I did timedcall from profilehooks by following:

@timedcall(immediate=True)
def merge_earlier_created_values(self):

我得到以下结果:

该方法的分析结果给出:

The result of profiling of that method gives:

@profile(immediate=True)
def merge_earlier_created_values(self):

使用Merge对函数进行分析如下:

The profiling of function, by using Merge is as follows:

*** PROFILER RESULTS ***
merge_earlier_created_values (E:ProjectsPredictive Inbound Cartoon     Estimation-MLOPythonCodeToSubmithelpersget_prev_data_by_date.py:122)
function called 1 times

     71665 function calls (70588 primitive calls) in 0.524 seconds

Ordered by: cumulative time, internal time, call count
List reduced from 563 to 40 due to restriction <40>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.012    0.012    0.524    0.524 get_prev_data_by_date.py:122(merge_earlier_created_values)
   14    0.000    0.000    0.285    0.020 generic.py:1901(_update_inplace)
   14    0.000    0.000    0.285    0.020 generic.py:1402(_maybe_update_cacher)
   19    0.000    0.000    0.284    0.015 generic.py:1492(_check_setitem_copy)
    7    0.283    0.040    0.283    0.040 {built-in method gc.collect}
   15    0.000    0.000    0.181    0.012 generic.py:1842(drop)
   10    0.000    0.000    0.153    0.015 merge.py:26(merge)
   10    0.000    0.000    0.140    0.014 merge.py:201(get_result)
  8/4    0.000    0.000    0.126    0.031 decorators.py:65(wrapper)
    4    0.000    0.000    0.126    0.031 frame.py:3028(drop_duplicates)
    1    0.000    0.000    0.102    0.102 get_prev_data_by_date.py:264(recreate_previous_cartons)
    1    0.000    0.000    0.101    0.101 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)
    1    0.000    0.000    0.098    0.098 get_prev_data_by_date.py:360(recreate_previous_freight_type)
   10    0.000    0.000    0.092    0.009 internals.py:4455(concatenate_block_managers)
   10    0.001    0.000    0.088    0.009 internals.py:4471(<listcomp>)
  120    0.001    0.000    0.084    0.001 internals.py:4559(concatenate_join_units)
  266    0.004    0.000    0.067    0.000 common.py:733(take_nd)
  120    0.000    0.000    0.061    0.001 internals.py:4569(<listcomp>)
  120    0.003    0.000    0.061    0.001 internals.py:4814(get_reindexed_values)
    1    0.000    0.000    0.059    0.059 get_prev_data_by_date.py:295(recreate_previous_appt_status)
   10    0.000    0.000    0.038    0.004 merge.py:322(_get_join_info)
   10    0.001    0.000    0.036    0.004 merge.py:516(_get_join_indexers)
   25    0.001    0.000    0.024    0.001 merge.py:687(_factorize_keys)
   74    0.023    0.000    0.023    0.000 {pandas.algos.take_2d_axis1_object_object}
   50    0.022    0.000    0.022    0.000 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects}
  120    0.003    0.000    0.022    0.000 internals.py:4479(get_empty_dtype_and_na)
   88    0.000    0.000    0.021    0.000 frame.py:1969(__getitem__)
    1    0.000    0.000    0.019    0.019 get_prev_data_by_date.py:328(recreate_previous_location_numbers)
   39    0.000    0.000    0.018    0.000 internals.py:3495(reindex_indexer)
  537    0.017    0.000    0.017    0.000 {built-in method numpy.core.multiarray.empty}
   15    0.000    0.000    0.017    0.001 ops.py:725(wrapper)
   15    0.000    0.000    0.015    0.001 frame.py:2011(_getitem_array)
   24    0.000    0.000    0.014    0.001 internals.py:3625(take)
   10    0.000    0.000    0.014    0.001 merge.py:157(__init__)
   10    0.000    0.000    0.014    0.001 merge.py:382(_get_merge_keys)
   15    0.008    0.001    0.013    0.001 ops.py:662(na_op)
  234    0.000    0.000    0.013    0.000 common.py:158(isnull)
  234    0.001    0.000    0.013    0.000 common.py:179(_isnull_new)
   15    0.000    0.000    0.012    0.001 generic.py:1609(take)
   20    0.000    0.000    0.012    0.001 generic.py:2191(reindex)

使用 Joins 的分析如下:

The profiling by using Joins is as follows:

65079 function calls (63990 primitive calls) in 0.550 seconds

Ordered by: cumulative time, internal time, call count
List reduced from 592 to 40 due to restriction <40>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.016    0.016    0.550    0.550 get_prev_data_by_date.py:122(merge_earlier_created_values)
   14    0.000    0.000    0.295    0.021 generic.py:1901(_update_inplace)
   14    0.000    0.000    0.295    0.021 generic.py:1402(_maybe_update_cacher)
   19    0.000    0.000    0.294    0.015 generic.py:1492(_check_setitem_copy)
    7    0.293    0.042    0.293    0.042 {built-in method gc.collect}
   10    0.000    0.000    0.173    0.017 generic.py:1842(drop)
   10    0.000    0.000    0.139    0.014 merge.py:26(merge)
  8/4    0.000    0.000    0.138    0.034 decorators.py:65(wrapper)
    4    0.000    0.000    0.138    0.034 frame.py:3028(drop_duplicates)
   10    0.000    0.000    0.132    0.013 merge.py:201(get_result)
    5    0.000    0.000    0.122    0.024 frame.py:4324(join)
    5    0.000    0.000    0.122    0.024 frame.py:4371(_join_compat)
    1    0.000    0.000    0.111    0.111 get_prev_data_by_date.py:264(recreate_previous_cartons)
    1    0.000    0.000    0.103    0.103 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)
    1    0.000    0.000    0.099    0.099 get_prev_data_by_date.py:360(recreate_previous_freight_type)
   10    0.000    0.000    0.093    0.009 internals.py:4455(concatenate_block_managers)
   10    0.001    0.000    0.089    0.009 internals.py:4471(<listcomp>)
  100    0.001    0.000    0.085    0.001 internals.py:4559(concatenate_join_units)
  205    0.003    0.000    0.068    0.000 common.py:733(take_nd)
  100    0.000    0.000    0.060    0.001 internals.py:4569(<listcomp>)
  100    0.001    0.000    0.060    0.001 internals.py:4814(get_reindexed_values)
    1    0.000    0.000    0.056    0.056 get_prev_data_by_date.py:295(recreate_previous_appt_status)
   10    0.000    0.000    0.033    0.003 merge.py:322(_get_join_info)
   52    0.031    0.001    0.031    0.001 {pandas.algos.take_2d_axis1_object_object}
    5    0.000    0.000    0.030    0.006 base.py:2329(join)
   37    0.001    0.000    0.027    0.001 internals.py:2754(apply)
    6    0.000    0.000    0.024    0.004 frame.py:2763(set_index)
    7    0.000    0.000    0.023    0.003 merge.py:516(_get_join_indexers)
    2    0.000    0.000    0.022    0.011 base.py:2483(_join_non_unique)
    7    0.000    0.000    0.021    0.003 generic.py:2950(copy)
    7    0.000    0.000    0.021    0.003 internals.py:3046(copy)
   84    0.000    0.000    0.020    0.000 frame.py:1969(__getitem__)
   19    0.001    0.000    0.019    0.001 merge.py:687(_factorize_keys)
  100    0.002    0.000    0.019    0.000 internals.py:4479(get_empty_dtype_and_na)
    1    0.000    0.000    0.018    0.018 get_prev_data_by_date.py:328(recreate_previous_location_numbers)
   15    0.000    0.000    0.017    0.001 ops.py:725(wrapper)
   34    0.001    0.000    0.017    0.000 internals.py:3495(reindex_indexer)
   83    0.004    0.000    0.016    0.000 internals.py:3211(_consolidate_inplace)
   68    0.015    0.000    0.015    0.000 {method 'copy' of 'numpy.ndarray' objects}
   15    0.000    0.000    0.015    0.001 frame.py:2011(_getitem_array)

如您所见,merge 比 joins 更快,虽然它的值很小,但是超过 4000 次迭代,这个小值在几分钟内变成了一个巨大的数字.

As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.

谢谢

推荐答案

set_index 在合并列上确实加快了这个速度.下面是 julien-marrec's Answer 的更现实版本.

set_index on merging column does indeed speed this up. Below is a slightly more realistic version of julien-marrec's Answer.

import pandas as pd
import numpy as np
myids=np.random.choice(np.arange(10000000), size=1000000, replace=False)
df1 = pd.DataFrame(myids, columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))

%%timeit
    x = df1.merge(df2, how='left', left_on='A', right_on='A2')   
#1 loop, best of 3: 664 ms per loop

%%timeit  
    x = df1.set_index('A').join(df2.set_index('A2'), how='left') 
#1 loop, best of 3: 354 ms per loop

%%time 
    df1.set_index('A', inplace=True)
    df2.set_index('A2', inplace=True)
#Wall time: 16 ms

%%timeit
    x = df1.join(df2, how='left')  
#10 loops, best of 3: 80.4 ms per loop

当要连接的列在两个表上的整数顺序不同时,您仍然可以期待 8 倍的显着加速.

When the column to be joined has integers not in the same order on both tables you can still expect a great speed up of 8 times.

这篇关于提高 Pandas 合并性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆