pandas groupby申请表现慢 [英] Pandas groupby apply performing slow

查看:102
本文介绍了 pandas groupby申请表现慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个包含大量数据的程序.我正在使用python pandas模块来查找数据中的错误.这通常工作非常快.但是,我编写的当前代码似乎比应该的慢,我正在寻找一种加快速度的方法.

I am working on a program that involves large amounts of data. I am using the python pandas module to look for errors in my data. This usually works very fast. However this current piece of code I wrote seems to be way slower than it should be and I am looking for a way to speed it up.

为了让你们正确地测试它,我上传了相当大的一段代码.您应该能够按原样运行它.代码中的注释应解释我在这里试图做的事情.任何帮助将不胜感激.

In order for you guys to properly test it I uploaded a rather large piece of code. You should be able to run it as is. The comments in the code should explain what I am trying to do here. Any help would be greatly appreciated.

# -*- coding: utf-8 -*-

import pandas as pd
import numpy as np

# Filling dataframe with data
# Just ignore this part for now, real data comes from csv files, this is an example of how it looks
TimeOfDay_options = ['Day','Evening','Night']
TypeOfCargo_options = ['Goods','Passengers']
np.random.seed(1234)
n = 10000

df = pd.DataFrame()
df['ID_number'] = np.random.randint(3, size=n)
df['TimeOfDay'] = np.random.choice(TimeOfDay_options, size=n)
df['TypeOfCargo'] = np.random.choice(TypeOfCargo_options, size=n)
df['TrackStart'] = np.random.randint(400, size=n) * 900
df['SectionStart'] = np.nan
df['SectionStop'] = np.nan

grouped_df = df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart'])
for index, group in grouped_df:
    if len(group) == 1:
        df.loc[group.index,['SectionStart']] = group['TrackStart']
        df.loc[group.index,['SectionStop']] = group['TrackStart'] + 899

    if len(group) > 1:
        track_start = group.loc[group.index[0],'TrackStart']
        track_end = track_start + 899
        section_stops = np.random.randint(track_start, track_end, size=len(group))
        section_stops[-1] = track_end
        section_stops = np.sort(section_stops)
        section_starts = np.insert(section_stops, 0, track_start)

        for i,start,stop in zip(group.index,section_starts,section_stops):
            df.loc[i,['SectionStart']] = start
            df.loc[i,['SectionStop']] = stop

#%% This is what a random group looks like without errors
#Note that each section neatly starts where the previous section ended
#There are no gaps (The whole track is defined)
grouped_df.get_group((2, 'Night', 'Passengers', 323100))

#%% Introducing errors to the data
df.loc[2640,'SectionStart'] += 100
df.loc[5390,'SectionStart'] += 7

#%% This is what the same group looks like after introducing errors 
#Note that the 'SectionStop' of row 1525 is no longer similar to the 'SectionStart' of row 2640
#This track now has a gap of 100, it is not completely defined from start to end
grouped_df.get_group((2, 'Night', 'Passengers', 323100))

#%% Try to locate the errors
#This is the part of the code I need to speed up

def Full_coverage(group):
    if len(group) > 1:
        #Sort the grouped data by column 'SectionStart' from low to high

        #Updated for newer pandas version
        #group.sort('SectionStart', ascending=True, inplace=True)
        group.sort_values('SectionStart', ascending=True, inplace=True)

        #Some initial values, overwritten at the end of each loop  
        #These variables correspond to the first row of the group
        start_km = group.iloc[0,4]
        end_km = group.iloc[0,5]
        end_km_index = group.index[0]

        #Loop through all the rows in the group
        #index is the index of the row
        #i is the 'SectionStart' of the row
        #j is the 'SectionStop' of the row
        #The loop starts from the 2nd row in the group
        for index, (i, j) in group.iloc[1:,[4,5]].iterrows():

            #The start of the next row must be equal to the end of the previous row in the group
            if i != end_km: 

                #Add the faulty data to the error list
                incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
                                    'Found startpoint: '+str(i)+' (row '+str(index)+')'))                

            #Overwrite these values for the next loop
            start_km = i
            end_km = j
            end_km_index = index

    return group

#Check if the complete track is completely defined (from start to end) for each combination of:
    #'ID_number','TimeOfDay','TypeOfCargo','TrackStart'
incomplete_coverage = [] #Create empty list for storing the error messages
df_grouped = df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']).apply(lambda x: Full_coverage(x))

#Print the error list
print('\nFound incomplete coverage in the following rows:')
for i,j in incomplete_coverage:
    print(i)
    print(j)
    print() 

#%%Time the procedure -- It is very slow, taking about 6.6 seconds on my pc
%timeit df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']).apply(lambda x: Full_coverage(x))

推荐答案

我相信问题是您的数据有5300个不同的组.因此,功能上任何缓慢的现象都会被放大.您可以在函数中使用向量化操作而不是使用for循环来节省时间,但是要省去几秒钟的简便得多的方法是使用return 0而不是return group.当您return group时,大熊猫实际上将创建一个新的数据对象,该对象将您的已排序组组合在一起,您似乎并没有使用它.当您return 0时,大熊猫将组合5300个零,这要快得多.

The problem, I believe, is that your data has 5300 distinct groups. Due to this, anything slow within your function will be magnified. You could probably use a vectorized operation rather than a for loop in your function to save time, but a much easier way to shave off a few seconds is to return 0 rather than return group. When you return group, pandas will actually create a new data object combining your sorted groups, which you don't appear to use. When you return 0, pandas will combine 5300 zeros instead, which is much faster.

例如:

cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
groups = df.groupby(cols)
print(len(groups))
# 5353

%timeit df.groupby(cols).apply(lambda group: group)
# 1 loops, best of 3: 2.41 s per loop

%timeit df.groupby(cols).apply(lambda group: 0)
# 10 loops, best of 3: 64.3 ms per loop

仅合并不使用的结果大约需要2.4秒;其余时间是循环中的实际计算,应尝试将其向量化.

Just combining the results you don't use is taking about 2.4 seconds; the rest of the time is actual computation in your loop which you should attempt to vectorize.

通过在for循环之前进行快速的附加矢量化检查并返回0而不是group,我将时间缩短到约2秒,这基本上是对每个组进行排序的成本.试试这个功能:

With a quick additional vectorized check before the for loop and returning 0 instead of group, I got the time down to about ~2sec, which is basically the cost of sorting each group. Try this function:

def Full_coverage(group):
    if len(group) > 1:
        group = group.sort('SectionStart', ascending=True)

        # this condition is sufficient to find when the loop
        # will add to the list
        if np.any(group.values[1:, 4] != group.values[:-1, 5]):
            start_km = group.iloc[0,4]
            end_km = group.iloc[0,5]
            end_km_index = group.index[0]

            for index, (i, j) in group.iloc[1:,[4,5]].iterrows():
                if i != end_km:
                    incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
                                        'Found startpoint: '+str(i)+' (row '+str(index)+')'))                
                start_km = i
                end_km = j
                end_km_index = index

    return 0

cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
%timeit df.groupby(cols).apply(Full_coverage)
# 1 loops, best of 3: 1.74 s per loop


这是一个示例,其中包含了我的建议,以便将排序移动到groupby之外并删除不必要的循环.对于给定的示例,删除循环的速度并不快,但是如果存在很多不完整的地方,则循环会更快:


Edit 2: here's an example which incorporates my suggestion to move the sort outside the groupby and to remove the unnecessary loops. Removing the loops is not much faster for the given example, but will be faster if there are a lot of incompletes:

def Full_coverage_new(group):
    if len(group) > 1:
        mask = group.values[1:, 4] != group.values[:-1, 5]
        if np.any(mask):
            err = ('Expected startpoint: {0} (row {1}) '
                   'Found startpoint: {2} (row {3})')
            incomplete_coverage.extend([err.format(group.iloc[i, 5],
                                                   group.index[i],
                                                   group.iloc[i + 1, 4],
                                                   group.index[i + 1])
                                        for i in np.where(mask)[0]])
    return 0

incomplete_coverage = []
cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart']
df_s = df.sort_values(['SectionStart','SectionStop'])
df_s.groupby(cols).apply(Full_coverage_nosort)

这篇关于 pandas groupby申请表现慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆