python multiprocessing-OverflowError('无法序列化大于4GiB的字节对象') [英] python multiprocessing - OverflowError('cannot serialize a bytes object larger than 4GiB')

查看:1650
本文介绍了python multiprocessing-OverflowError('无法序列化大于4GiB的字节对象')的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用multiprocessing库(python 3.6)运行脚本,其中将大pd.DataFrame作为参数传递给函数:

We are running a script using the multiprocessing library (python 3.6), where a big pd.DataFrame is passed as an argument to a function :

from multiprocessing import Pool
import time 

def my_function(big_df):
    # do something time consuming
    time.sleep(50)

if __name__ == '__main__':
    with Pool(10) as p:
        res = {}
        output = {}
        for id, big_df in some_dict_of_big_dfs:
            res[id] = p.apply_async(my_function,(big_df ,))
        output = {u : res[id].get() for id in id_list}

问题是我们从pickle库中得到了一个错误.

The problem is that we are getting an error from the pickle library.

原因:"OverflowError('无法序列化大于 4GiB',)'

Reason: 'OverflowError('cannot serialize a bytes objects larger than 4GiB',)'

我们知道pickle v4可以序列化更大的对象相关问题链接,但我们不知道如何修改multiprocessing使用的协议.

We are aware than pickle v4 can serialize larger objects question related, link, but we don't know how to modify the protocol that multiprocessing is using.

有人知道该怎么做吗? 谢谢!!

does anybody know what to do? Thanks !!

推荐答案

显然有一个公开的问题关于此主题,有关此特定 answer 的内容,我们介绍了一些相关的举措.我找到了一种方法来更改基于此multiprocessing库中使用的默认pickle协议-method-used-by-pythons-multiprocessing> answer .正如评论中指出的那样,该解决方案仅适用于Linux和OS多处理库

Apparently is there an open issue about this topic , and there is a few related initiatives described on this particular answer. I Found a way to change the default pickle protocol that is used in the multiprocessing library based on this answer. As was pointed out in the comments this solution Only works with Linux and OS multiprocessing lib

解决方案:

您首先创建一个新的分隔模块

You first create a new separated module

pickle4reducer.py

pickle4reducer.py

from multiprocessing.reduction import ForkingPickler, AbstractReducer

class ForkingPickler4(ForkingPickler):
    def __init__(self, *args):
        if len(args) > 1:
            args[1] = 2
        else:
            args.append(2)
        super().__init__(*args)

    @classmethod
    def dumps(cls, obj, protocol=4):
        return ForkingPickler.dumps(obj, protocol)


def dump(obj, file, protocol=4):
    ForkingPickler4(file, protocol).dump(obj)


class Pickle4Reducer(AbstractReducer):
    ForkingPickler = ForkingPickler4
    register = ForkingPickler4.register
    dump = dump

然后,在您的主脚本中,您需要添加以下内容:

And then, in your main script you need to add the following:

import pickle4reducer
import multiprocessing as mp
ctx = mp.get_context()
ctx.reducer = pickle4reducer.Pickle4Reducer()

with mp.Pool(4) as p:
    # do something

这可能会解决溢出问题.

That will probably solve the problem of the overflow.

但是,警告一下,您可以考虑在进行任何操作之前先阅读 ,否则您可能会遇到和我一样的错误:

But, warning, you might consider reading this before doing anything or you might reach the same error as me:

'i'格式要求-2147483648< =数字< = 2147483647

'i' format requires -2147483648 <= number <= 2147483647

(此错误的原因已在上面的链接中进行了很好的说明).长话短说,multiprocessing使用pickle协议通过其所有过程发送数据,如果您已经达到4gb限制,则可能意味着您可能考虑将函数更多地定义为无效"方法而不是输入/输出方法.所有这些入站/出站数据都增加了RAM的使用,在构造上(我的情况)可能效率低下,并且最好将所有进程都指向同一对象,而不是为每个调用创建一个新副本.

(the reason of this error is well explained in the link above). Long story short, multiprocessing send data through all its process using the pickle protocol, if you are already reaching the 4gb limit, that probably means that you might consider redefining your functions more as "void" methods rather than input/output methods. All this inbound/outbound data increase the RAM usage, is probably inefficient by construction (my case) and it might be better to point all process to the same object rather than create a new copy for each call.

希望这会有所帮助.

这篇关于python multiprocessing-OverflowError('无法序列化大于4GiB的字节对象')的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆