在以数据框为输入的模型上进行多处理 [英] Multiprocessing on a model with data frame as input
问题描述
我想对模型使用多重处理,以使用数据框作为输入来获取预测.我有以下代码:
I want to use multiprocessing on a model to get predictions using a data frame as input. I have the following code:
def perform_model_predictions(model, dataFrame, cores=4):
try:
with Pool(processes=cores) as pool:
result = pool.map(model.predict, dataFrame)
return result
# return model.predict(dataFrame)
except AttributeError:
logging.error("AttributeError occurred", exc_info=True)
我得到的错误是:
raise TypeError("sparse matrix length is ambiguous; use getnnz()"
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
我认为问题在于我将数据帧作为第二个参数传递给pool.map
函数.任何建议或帮助,将不胜感激.
I think the issue is with the fact that I'm passing in a data frame as the second parameter to the pool.map
function. Any advice or help would be appreciated.
推荐答案
诀窍是将数据帧拆分为多个块. map
预期将由model.predict
处理的对象列表.这是一个完整的工作示例,其中的模型显然被嘲笑了:
The trick is to split your dataframe into chunks. map
expects a list of objects that are going to be processed by the model.predict
. Here's a full working example, with model obviously mocked:
import numpy as np
import pandas as pd
from multiprocessing import Pool
no_cores = 4
large_df = pd.concat([pd.Series(np.random.rand(1111)), pd.Series(np.random.rand(1111))], axis = 1)
chunk_size = len(large_df) // no_cores + no_cores
chunks = [df_chunk for g, df_chunk in large_df.groupby(np.arange(len(large_df)) // chunk_size)]
class model(object):
@staticmethod
def predict(df):
return np.random.randint(0,2)
def perform_model_predictions(model, dataFrame, cores):
try:
with Pool(processes=cores) as pool:
result = pool.map(model.predict, dataFrame)
return result
# return model.predict(dataFrame)
except AttributeError:
logging.error("AttributeError occurred", exc_info=True)
perform_model_predictions(model, chunks, no_cores)
请注意,此处选择的块数应与内核数(或您要分配的任何数字)相匹配.这样,每个内核都能获得公平的份额,并且multiprocessing
不会在对象序列化上花费很多时间.
Mind that the number of chunks here is selected such that it matches number of cores (or simply any number you want to allocate). This way each core gets a fair share and multiprocessing
does not spend much time on object serialization.
如果您要分别处理每一行(pd.Series
),则可能需要花费在序列化上的时间.在这种情况下,我建议使用joblib
并阅读其各种后端的文档.我没有写它,因为您似乎想在pd.Dataframe
上调用预报.
If you'd like to process each row (pd.Series
) separately, time spent on serialization could be a concern. In such case I'd recommend using joblib
and reading docs on its various backends. I did not write on it as it seemed you want to call predict on pd.Dataframe
.
额外警告
multiprocessing
可能会使它变得更糟,而不是让您获得更好的性能.当您的model.predict
调用本身生成线程的外部模块时,这种情况很少发生.我在此处上写过这个问题.长话短说,joblib
再次可能是答案.
It can happen that multiprocessing
, instead of getting you better performance, will make it worse. It happens in rather rare situations when your model.predict
calls external modules that themselves spawn threads. I wrote about the issue here. Long story short, joblib
again could be an answer.
这篇关于在以数据框为输入的模型上进行多处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!