加快python应用行明智函数 [英] speed up python apply row wise functions
问题描述
我正在处理数据清理项目之一,因此我必须清理熊猫数据框的多个字段.通常,我正在编写正则表达式和简单函数.下面的示例
I am working on one of the data cleansing project, I have to clean multiple fields of a pandas data frame as part of it. Mostly I am writing regular expressions and simple functions. Examples below,
def func1(s):
s = str(s)
s = s.replace(' ', '')
if len(s) > 0 and s != '0':
if s.isalpha() and len(s) < 2:
return s
def func2(s):
s = str(s)
s = s.replace(' ', '')
s = s.strip(whitespace+','+'-'+'/'+'\\')
if s != '0':
if s.isalnum() or s.isdigit():
return s
def func3(s):
s = str(s)
if s.isdigit() and s != '0':
return s
else:
return None
def func4(s):
if str(s['j']).isalpha() and str(s['k']).isdigit() and s['l'] is none:
return s['k']
并这样称呼他们.
x['a'] = x['b'].apply(lambda x: func1(x) if pd.notnull(x) else x)
x['c'] = x['d'].apply(lambda x: func2(x) if pd.notnull(x) else x)
x['e'] = x['f'].apply(lambda x: func3(x) if pd.notnull(x) else x)
x['g'] = x.apply(lambda x: func4(x), axis = 1)
这里一切都很好,但是我已经写了将近50个这样的函数,并且我的数据集有超过1000万条记录.脚本运行了几个小时,如果我的理解是正确的,则将这些函数按行进行调用,因此每个函数的调用次数与行的调用次数相同,并且需要花费很长时间来处理.有没有一种方法可以对此进行优化?我怎样才能更好地解决这个问题?可能不是通过套用功能?谢谢.
Everything is fine here, however I have written nearly 50 such functions like this and my dataset has more than 10 million records. Script runs for hours, If my understanding is correct, the functions are called row wise, so each function is called as many times as the rows and its taking long time to process this. Is there a way to optimise this? How can I approach this in a better way? May be not through apply function? Thanks.
样本数据集:-
Name f j b
339043 Moir Point RD 3 0
21880 Fisher-Point Drive Freemans Ba 6 0
457170 Whakamoenga Point 29 0
318399 Motukaraka Point RD 0 0
274047 Apirana Avenue Point England 360 0 366
207588 Hobsonville Point RD 127 0
747136 Dog Point RD 130 0
325704 Aroha Road Te Arai Point 36 0
291888 One Tree Point RD 960 0
207954 Hobsonville Point RD 160 0 205D
248410 Huia Road Point Chevalier 106 0
推荐答案
通常,您应该避免在DataFrame
上调用.apply
.这确实是让您受益的地方.在后台,它正在为DataFrame
中的每一行创建一个新的Series
,并将其发送到传递给.apply
的函数.不用说,每行的开销很大,因此.apply
在完整的DataFrame
上很慢.
In general, you should avoid calling .apply
on a DataFrame
. This is really what is getting you. Under the hood, it is creating a new Series
for each row in the DataFrame
and sends that to the function passed to .apply
. Needless to say, this is quite a lot of overhead per row and thus .apply
is on a full DataFrame
is slow.
在下面的示例中,由于示例数据受到限制,我已重命名了函数调用中的某些列.
In the below example, I have renamed some of the columns in the function calls since the example data was limited.
import sys
import time
import contextlib
import pandas as pd
@contextlib.contextmanager
def timethis(label):
'''A context manager to time a bit of code.'''
print('Timing', label, end=': ')
sys.stdout.flush()
start = time.time()
yield
print('{:.4g} seconds'.format(time.time() - start))
... func1, func2, and func3 definitions...
def func4(s):
if str(s['j']).isalpha() and str(s['f']).isdigit() and s['b'] is none:
return s['f']
x = pd.DataFrame({'f': [3, 6, 29, 0, 360, 127, 130, 36, 960, 160, 106],
'j': 0,
'b': [None, None, None, None, 366, None, None, None, None, '205D', None]})
x = pd.concat(x for _ in range(100000))
y = x.copy()
x['a'] = x['b'].apply(lambda x: func1(x) if pd.notnull(x) else x)
x['c'] = x['j'].apply(lambda x: func2(x) if pd.notnull(x) else x)
x['e'] = x['f'].apply(lambda x: func3(x) if pd.notnull(x) else x)
with timethis('func4'):
x['g'] = x.apply(func4, axis = 1) # The lambda in your example was not needed
...
def vectorized_func4(df):
'''Accept the whole DataFrame and not just a single row.'''
j_isalpha = df['j'].astype(str).str.isalpha()
f_isdigit = df['f'].astype(str).str.isdigit()
b_None = df['b'].isnull()
ret_col = df['f'].copy()
keep_rows = j_isalpha & f_isdigit & b_None
ret_col[~keep_rows] = None
return ret_col
y['a'] = vectorized_func1(y['b'])
y['c'] = vectorized_func2(y['j'])
y['e'] = vectorized_func3(y['f'])
with timethis('vectorized_func4'):
y['g'] = vectorized_func4(y)
输出:
Timing func4: 115.9 seconds
Timing vectorized_func4: 25.09 seconds
事实证明,与矢量化方法相比,对于func1
,func2
和func3
来说,在性能上是一种洗礼. Series
上的.apply
(和.map
)并不是那么慢,因为每个元素没有额外的开销.但是,这不是不是的意思,即当您拥有Series
时应仅使用.apply
,而不要研究Series
的向量化内置方法-通常,您可能能够比apply
做得更好.
It turns out that for func1
, func2
, and func3
it is a wash as far as performance when compared to the vectorized methods. .apply
(and .map
for that matter) on Series
isn't so slow because there is no extra overhead per element. However, this does not mean that you should just use .apply
when you have a Series
and not investigate the vectorized built-in methods of the Series
- more often than not you are likely to be able to do better than apply
.
这是您重写func3
进行矢量化的方式(我添加了时序语句,以便我们可以看到花费最多的时间).
Here's how you might rewrite func3
to be vectorized (I added timing statements so we could see what takes the most time).
def vectorized_func3(col):
with timethis('fillna'):
col = col.fillna('')
with timethis('astype'):
col = col.astype(str)
with timethis('rest'):
is_digit_string = col.str.isdigit()
not_0_string = col != '0'
keep_rows = is_digit_string & not_0_string
col[~keep_rows] = None
return col
以下是与func3
相比的时间:
Timing func3: 8.302 seconds
Timing fillna: 0.006584 seconds
Timing astype: 9.445 seconds
Timing rest: 1.65 seconds
仅更改Series
的dtype
需要花费很长时间,因为必须创建一个新的Series
,然后每个元素都将被转换.其他一切炽烈.如果您可以更改算法而不必将数据类型更改为str
,或者只需将其存储为str
,那么矢量化方法将更快 (尤其是vectorized_func4
)
It takes a long time to just change the dtype
of a Series
, since a new Series
must be created, and then each element gets cast. Everything else is blazing. If you could change your algorithm to not require changing the datatype to str
, or could simply store as str
in the first place then the vectorized method would be much faster (especially vectorized_func4
).
外卖
- 除非绝对必要,否则请勿在完整的
DataFrame
上使用.apply
.如果您认为必须这样做,那就去喝一杯咖啡,然后思考十分钟,然后尝试思考一种不用.apply
的方法. - 尽量不要在
Series
上使用.apply
,您可能会做得更好,但不会像完整的DataFrame
上那样糟糕. - 尝试提出不需要不断转换
dtype
的算法.
- Don't use
.apply
on a fullDataFrame
unless you absolutely must. If you think you must, go get a drink of coffee and think about it for ten minutes and try to think of a way to do it without.apply
. - Try not to use
.apply
on aSeries
, you can probably do better but it won't be as bad as on a fullDataFrame
. - Try to come up with an algorithm that does not require constantly converting
dtype
.
这篇关于加快python应用行明智函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!