如何并行地将函数应用于pandas DataFrame的多个列 [英] How to apply a function to mulitple columns of a pandas DataFrame in parallel
问题描述
我有一个具有成千上万行的pandas DataFrame,我想在该DataFrame的多列上并行应用耗时的函数.
I have a pandas DataFrame with hundreds of thousands of rows, and I want to apply a time-consuming function on multiple columns of that DataFrame in parallel.
我知道如何依次应用该功能.例如:
I know how to apply the function serially. For example:
import hashlib
import pandas as pd
df = pd.DataFrame(
{'col1': range(100_000), 'col2': range(100_000, 200_000)},
columns=['col1', 'col2'])
def foo(col1, col2):
# This function is actually much more time consuming in real life
return hashlib.md5(f'{col1}-{col2}'.encode('utf-8')).hexdigest()
df['md5'] = df.apply(lambda row: foo(row.col1, row.col2), axis=1)
df.head()
# Out[5]:
# col1 col2 md5
# 0 0 100000 92e2a2c7a6b7e3ee70a1c5a5f2eafd13
# 1 1 100001 01d14f5020a8ba2715cbad51fd4c503d
# 2 2 100002 c0e01b86d0a219cd71d43c3cc074e323
# 3 3 100003 d94e31d899d51bc00512938fc190d4f6
# 4 4 100004 7710d81dc7ded13326530df02f8f8300
但是我将如何利用计算机上所有可用的内核并行应用功能foo
?
But how would I apply function foo
parallel, utilizing all available cores on my machine?
推荐答案
最简单的方法是使用 concurrent.futures
.
The easiest way to do this is using concurrent.futures
.
import concurrent.futures
with concurrent.futures.ProcessPoolExecutor(16) as pool:
df['md5'] = list(pool.map(foo, df['col1'], df['col2'], chunksize=1_000))
df.head()
# Out[10]:
# col1 col2 md5
# 0 0 100000 92e2a2c7a6b7e3ee70a1c5a5f2eafd13
# 1 1 100001 01d14f5020a8ba2715cbad51fd4c503d
# 2 2 100002 c0e01b86d0a219cd71d43c3cc074e323
# 3 3 100003 d94e31d899d51bc00512938fc190d4f6
# 4 4 100004 7710d81dc7ded13326530df02f8f8300
指定chunksize=1_000
可使此过程运行得更快,因为每个进程一次将处理1000
行(即,您将仅支付每1000行初始化一次进程的开销).
Specifying chunksize=1_000
makes this run faster because each process will process 1000
rows at a time (i.e. you will pay the overhead of initializing a process only once per 1000 rows).
请注意,这仅适用于Python 3.2或更高版本.
Note that this will only work in Python 3.2 or newer.
这篇关于如何并行地将函数应用于pandas DataFrame的多个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!