python中的多处理-在多个进程之间共享大对象(例如pandas数据框) [英] multiprocessing in python - sharing large object (e.g. pandas dataframe) between multiple processes

查看:306
本文介绍了python中的多处理-在多个进程之间共享大对象(例如pandas数据框)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在更精确地使用Python多处理

I am using Python multiprocessing, more precisely

from multiprocessing import Pool
p = Pool(15)

args = [(df, config1), (df, config2), ...] #list of args - df is the same object in each tuple
res = p.map_async(func, args) #func is some arbitrary function
p.close()
p.join()

这种方法会消耗大量内存;几乎耗尽了我所有的RAM(这时它变得非常慢,因此使多处理变得毫无用处).我认为问题是df是一个巨大的对象(一个大熊猫数据框),并且每个过程都会复制它.我尝试使用multiprocessing.Value共享数据框而不进行复制

This approach has a huge memory consumption; eating up pretty much all my RAM (at which point it gets extremely slow, hence making the multiprocessing pretty useless). I assume the problem is that df is a huge object (a large pandas dataframe) and it gets copied for each process. I have tried using multiprocessing.Value to share the dataframe without copying

shared_df = multiprocessing.Value(pandas.DataFrame, df)
args = [(shared_df, config1), (shared_df, config2), ...] 

(如 Python多处理共享内存中的建议),但这给了我TypeError: this type has no size(与在Python进程之间共享一个复杂的对象?,不幸的是,我不理解答案.

(as suggested in Python multiprocessing shared memory), but that gives me TypeError: this type has no size (same as Sharing a complex object between Python processes?, to which I unfortunately don't understand the answer).

我第一次使用多重处理,也许我的理解还不够好.实际上,在这种情况下multiprocessing.Value甚至是正确的选择吗?我看到了其他建议(例如队列),但现在有点困惑.有哪些选项可以共享内存,在这种情况下哪一种是最好的?

I am using multiprocessing for the first time and maybe my understanding is not (yet) good enough. Is multiprocessing.Value actually even the right thing to use in this case? I have seen other suggestions (e.g. queue) but am by now a bit confused. What options are there to share memory, and which one would be best in this case?

推荐答案

Value的第一个参数是 typecode_or_type .定义为:

The first argument to Value is typecode_or_type. That is defined as:

typecode_or_type确定返回对象的类型:它是 ctypes类型或一个字符类型代码,由 数组模块. * args传递给该类型的构造函数.

typecode_or_type determines the type of the returned object: it is either a ctypes type or a one character typecode of the kind used by the array module. *args is passed on to the constructor for the type.

强调我的.因此,您根本无法将熊猫数据框放在Value中,它必须是一个ctypes类型.

Emphasis mine. So, you simply cannot put a pandas dataframe in a Value, it has to be a ctypes type.

您可以改为使用multiprocessing.Manager为所有进程提供单例数据框实例.最终有几种不同的方法-可能最简单的方法是将数据框放入管理器的Namespace.

You could instead use a multiprocessing.Manager to serve your singleton dataframe instance to all of your processes. There's a few different ways to end up in the same place - probably the easiest is to just plop your dataframe into the manager's Namespace.

from multiprocessing import Manager

mgr = Manager()
ns = mgr.Namespace()
ns.df = my_dataframe

# now just give your processes access to ns, i.e. most simply
# p = Process(target=worker, args=(ns, work_unit))

现在,任何传递了对Manager的引用的进程都可以访问您的数据框实例.或者只是传递对Namespace的引用,它会更干净.

Now your dataframe instance is accessible to any process that gets passed a reference to the Manager. Or just pass a reference to the Namespace, it's cleaner.

我没有/不会介绍的一件事是事件和信号-如果您的进程需要等待其他人完成执行,则需要在其中添加.

One thing I didn't/won't cover is events and signaling - if your processes need to wait for others to finish executing, you'll need to add that in. Here is a page with some Event examples which also cover with a bit more detail how to use the manager's Namespace.

(请注意,这些都没有解决multiprocessing是否会带来明显的性能优势,这只是为您提供了探索该问题的工具)

(note that none of this addresses whether multiprocessing is going to result in tangible performance benefits, this is just giving you the tools to explore that question)

这篇关于python中的多处理-在多个进程之间共享大对象(例如pandas数据框)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆