加速python应用行明智的函数 [英] speed up python apply row wise functions

查看:30
本文介绍了加速python应用行明智的函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理其中一个数据清理项目,作为其中的一部分,我必须清理 Pandas 数据框的多个字段.我主要是在编写正则表达式和简单的函数.下面的例子,

I am working on one of the data cleansing project, I have to clean multiple fields of a pandas data frame as part of it. Mostly I am writing regular expressions and simple functions. Examples below,

def func1(s):
    s = str(s)
    s = s.replace(' ', '')
    if len(s) > 0 and s != '0':
        if s.isalpha() and len(s) < 2:
            return s

def func2(s):
    s = str(s)
    s = s.replace(' ', '')
    s = s.strip(whitespace+','+'-'+'/'+'\\')
    if s != '0':
        if s.isalnum() or s.isdigit():
            return s

def func3(s):
    s = str(s)
    if s.isdigit() and s != '0':
        return s
    else:
        return None

def func4(s):
    if str(s['j']).isalpha() and str(s['k']).isdigit() and s['l'] is none:
        return s['k']

然后这样称呼他们.

x['a'] = x['b'].apply(lambda x: func1(x) if pd.notnull(x) else x)
x['c'] = x['d'].apply(lambda x: func2(x) if pd.notnull(x) else x)
x['e'] = x['f'].apply(lambda x: func3(x) if pd.notnull(x) else x)
x['g'] = x.apply(lambda x: func4(x), axis = 1)

这里一切都很好,但是我已经编写了近 50 个这样的函数,我的数据集有超过 1000 万条记录.脚本运行了几个小时,如果我的理解是正确的,这些函数被称为按行调用,因此每个函数被调用的次数与行一样多,并且需要很长时间来处理它.有没有办法优化这个?我怎样才能以更好的方式解决这个问题?可能不是通过应用功能?谢谢.

Everything is fine here, however I have written nearly 50 such functions like this and my dataset has more than 10 million records. Script runs for hours, If my understanding is correct, the functions are called row wise, so each function is called as many times as the rows and its taking long time to process this. Is there a way to optimise this? How can I approach this in a better way? May be not through apply function? Thanks.

样本数据集:-

        Name                               f    j    b
339043  Moir Point RD                      3    0   
21880   Fisher-Point Drive Freemans Ba     6    0   
457170  Whakamoenga Point                 29    0   
318399  Motukaraka Point RD                0    0   
274047  Apirana Avenue Point England     360    0   366
207588  Hobsonville Point RD             127    0   
747136  Dog Point RD                     130    0   
325704  Aroha Road Te Arai Point          36    0   
291888  One Tree Point RD                960    0   
207954  Hobsonville Point RD             160    0   205D
248410  Huia Road Point Chevalier        106    0   

推荐答案

通常,您应该避免在 DataFrame 上调用 .apply.这才是真正让您受益的原因.在幕后,它正在为 DataFrame 中的每一行创建一个新的 Series,并将其发送到传递给 .apply 的函数.不用说,这是每行相当多的开销,因此 .apply 在一个完整的 DataFrame 上很慢.

In general, you should avoid calling .apply on a DataFrame. This is really what is getting you. Under the hood, it is creating a new Series for each row in the DataFrame and sends that to the function passed to .apply. Needless to say, this is quite a lot of overhead per row and thus .apply is on a full DataFrame is slow.

在下面的示例中,由于示例数据有限,我重命名了函数调用中的一些列.

In the below example, I have renamed some of the columns in the function calls since the example data was limited.

import sys
import time
import contextlib
import pandas as pd

@contextlib.contextmanager
def timethis(label):
    '''A context manager to time a bit of code.'''
    print('Timing', label, end=': ')
    sys.stdout.flush()
    start = time.time()
    yield
    print('{:.4g} seconds'.format(time.time() - start))

... func1, func2, and func3 definitions...

def func4(s):
    if str(s['j']).isalpha() and str(s['f']).isdigit() and s['b'] is none:
        return s['f']

x = pd.DataFrame({'f': [3, 6, 29, 0, 360, 127, 130, 36, 960, 160, 106],
                  'j': 0,
                  'b': [None, None, None, None, 366, None, None, None, None, '205D', None]})
x = pd.concat(x for _ in range(100000))
y = x.copy()

x['a'] = x['b'].apply(lambda x: func1(x) if pd.notnull(x) else x)
x['c'] = x['j'].apply(lambda x: func2(x) if pd.notnull(x) else x)
x['e'] = x['f'].apply(lambda x: func3(x) if pd.notnull(x) else x)
with timethis('func4'):
    x['g'] = x.apply(func4, axis = 1)  # The lambda in your example was not needed

...

def vectorized_func4(df):
    '''Accept the whole DataFrame and not just a single row.'''
    j_isalpha = df['j'].astype(str).str.isalpha()
    f_isdigit = df['f'].astype(str).str.isdigit()
    b_None = df['b'].isnull()
    ret_col = df['f'].copy()
    keep_rows = j_isalpha & f_isdigit & b_None
    ret_col[~keep_rows] = None
    return ret_col

y['a'] = vectorized_func1(y['b'])
y['c'] = vectorized_func2(y['j'])
y['e'] = vectorized_func3(y['f'])
with timethis('vectorized_func4'):
    y['g'] = vectorized_func4(y)

输出:

Timing func4: 115.9 seconds
Timing vectorized_func4: 25.09 seconds

事实证明,对于func1func2func3 而言,与矢量化方法相比,它在性能方面是一个冲刷..apply(和 .map 就此而言)在 Series 上并没有那么慢,因为每个元素没有额外的开销.然而,这并不意味着当你有一个 Series 时你应该只使用 .apply 而不要研究系列 - 通常情况下,您可能比 apply 做得更好.

It turns out that for func1, func2, and func3 it is a wash as far as performance when compared to the vectorized methods. .apply (and .map for that matter) on Series isn't so slow because there is no extra overhead per element. However, this does not mean that you should just use .apply when you have a Series and not investigate the vectorized built-in methods of the Series - more often than not you are likely to be able to do better than apply.

以下是重写 func3 以进行矢量化的方法(我添加了计时语句,以便我们了解什么需要花费最多时间).

Here's how you might rewrite func3 to be vectorized (I added timing statements so we could see what takes the most time).

def vectorized_func3(col):
    with timethis('fillna'):
        col = col.fillna('')
    with timethis('astype'):
        col = col.astype(str)
    with timethis('rest'):
        is_digit_string = col.str.isdigit()
        not_0_string = col != '0'
        keep_rows = is_digit_string & not_0_string
        col[~keep_rows] = None
    return col

这是与 func3 相比的时间:

Here is the timing compared to func3:

Timing func3: 8.302 seconds
Timing fillna: 0.006584 seconds
Timing astype: 9.445 seconds
Timing rest: 1.65 seconds

仅仅改变一个Seriesdtype需要很长时间,因为必须创建一个新的Series,然后每个元素被施放.其他一切都是炽热.如果您可以将算法更改为不需要将数据类型更改为 str,或者可以简单地存储为 str,那么矢量化方法将很多 更快(尤其是 vectorized_func4).

It takes a long time to just change the dtype of a Series, since a new Series must be created, and then each element gets cast. Everything else is blazing. If you could change your algorithm to not require changing the datatype to str, or could simply store as str in the first place then the vectorized method would be much faster (especially vectorized_func4).

外卖

  • 除非绝对必须,否则不要在完整的 DataFrame 上使用 .apply.如果您认为必须,请喝杯咖啡,然后思考十分钟,然后尝试想出一种无需 .apply 的方法.
  • 尽量不要在 Series 上使用 .apply,你可能会做得更好,但不会像在完整的 DataFrame.
  • 尝试提出一种不需要不断转换 dtype 的算法.
  • Don't use .apply on a full DataFrame unless you absolutely must. If you think you must, go get a drink of coffee and think about it for ten minutes and try to think of a way to do it without .apply.
  • Try not to use .apply on a Series, you can probably do better but it won't be as bad as on a full DataFrame.
  • Try to come up with an algorithm that does not require constantly converting dtype.

这篇关于加速python应用行明智的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆