多处理中的共享内存 [英] Shared memory in multiprocessing

查看:89
本文介绍了多处理中的共享内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有三个大名单.第一个包含位数组(模块位数组0.8.0),另外两个包含整数数组.

I have three large lists. First contains bitarrays (module bitarray 0.8.0) and the other two contain arrays of integers.

l1=[bitarray 1, bitarray 2, ... ,bitarray n]
l2=[array 1, array 2, ... , array n]
l3=[array 1, array 2, ... , array n]

这些数据结构占用大量RAM(总计约16GB).

These data structures take quite a bit of RAM (~16GB total).

如果我使用以下方法启动12个子流程:

If i start 12 sub-processes using:

multiprocessing.Process(target=someFunction, args=(l1,l2,l3))

这是否意味着将为每个子流程复制l1,l2和l3,或者子流程将共享这些列表?或者更直接地说,我将使用16GB还是192GB的RAM?

Does this mean that l1, l2 and l3 will be copied for each sub-process or will the sub-processes share these lists? Or to be more direct, will I use 16GB or 192GB of RAM?

someFunction将从这些列表中读取一些值,然后根据读取的值执行一些计算.结果将返回到父进程.列表l1,l2和l3不会被someFunction修改.

someFunction will read some values from these lists and then performs some calculations based on the values read. The results will be returned to the parent-process. The lists l1, l2 and l3 will not be modified by someFunction.

因此,我将假定子流程不需要并且不会复制这些庞大的列表,而只会与父级共享它们.意味着由于linux下的写时复制"方法,该程序将占用16GB的RAM(无论我启动了多少个子进程)? 我是对还是遗漏了一些会导致列表被复制的东西?

Therefore i would assume that the sub-processes do not need and would not copy these huge lists but would instead just share them with the parent. Meaning that the program would take 16GB of RAM (regardless of how many sub-processes i start) due to the copy-on-write approach under linux? Am i correct or am i missing something that would cause the lists to be copied?

编辑: 在阅读了有关该主题的更多内容后,我仍然感到困惑.一方面,Linux使用写时复制,这意味着没有数据被复制.另一方面,访问对象将更改其引用计数(我仍然不确定为什么以及这意味着什么).即使这样,整个对象会被复制吗?

EDIT: I am still confused, after reading a bit more on the subject. On the one hand Linux uses copy-on-write, which should mean that no data is copied. On the other hand, accessing the object will change its ref-count (i am still unsure why and what does that mean). Even so, will the entire object be copied?

例如,如果我按如下方式定义someFunction:

For example if i define someFunction as follows:

def someFunction(list1, list2, list3):
    i=random.randint(0,99999)
    print list1[i], list2[i], list3[i]

使用此功能是否意味着将为每个子流程完全复制l1,l2和l3?

Would using this function mean that l1, l2 and l3 will be copied entirely for each sub-process?

有没有办法检查这个?

EDIT2 在阅读了更多内容并监视了子进程运行时系统的总内存使用情况之后,似乎确实为每个子进程都复制了整个对象.似乎是因为引用计数.

EDIT2 After reading a bit more and monitoring total memory usage of the system while sub-processes are running, it seems that entire objects are indeed copied for each sub-process. And it seems to be because reference counting.

在我的程序中实际上不需要l1,l2和l3的引用计数.这是因为l1,l2和l3将保留在内存中(不变),直到父进程退出.在此之前,无需释放这些列表使用的内存.实际上,我可以肯定的是,在程序退出之前,引用计数将保持大于0(对于这些列表以及这些列表中的每个对象).

The reference counting for l1, l2 and l3 is actually unneeded in my program. This is because l1, l2 and l3 will be kept in memory (unchanged) until the parent-process exits. There is no need to free the memory used by these lists until then. In fact i know for sure that the reference count will remain above 0 (for these lists and every object in these lists) until the program exits.

所以现在问题来了,我如何确保对象不会复制到每个子流程中?我可以禁用这些列表以及这些列表中的每个对象的引用计数吗?

So now the question becomes, how can i make sure that the objects will not be copied to each sub-process? Can i perhaps disable reference counting for these lists and each object in these lists?

EDIT3 .子流程不需要修改l1l2l3或这些列表中的任何对象.子流程只需要能够引用其中一些对象,而不会导致为每个子流程复制内存.

EDIT3 Just an additional note. Sub-processes do not need to modify l1, l2 and l3 or any objects in these lists. The sub-processes only need to be able to reference some of these objects without causing the memory to be copied for each sub-process.

推荐答案

通常来说,共享相同数据有两种方法:

Generally speaking, there are two ways to share the same data:

  • 多线程
  • 共享内存

Python的多线程不适用于受CPU约束的任务(由于GIL),因此在这种情况下,通常的解决方案是继续执行multiprocessing.但是,使用此解决方案时,您需要使用 multiprocessing.Value multiprocessing.Array .

Python's multithreading is not suitable for CPU-bound tasks (because of the GIL), so the usual solution in that case is to go on multiprocessing. However, with this solution you need to explicitly share the data, using multiprocessing.Value and multiprocessing.Array.

请注意,由于所有同步问题,通常在进程之间共享数据可能不是最佳选择.通常,涉及参与者交换消息的方法是更好的选择.另请参见 Python文档:

Note that usually sharing data between processes may not be the best choice, because of all the synchronization issues; an approach involving actors exchanging messages is usually seen as a better choice. See also Python documentation:

如上所述,在进行并发编程时,通常是 最好尽量避免使用共享状态.这是 使用多个进程时尤其如此.

As mentioned above, when doing concurrent programming it is usually best to avoid using shared state as far as possible. This is particularly true when using multiple processes.

但是,如果您确实确实需要使用一些共享数据,则 多重处理提供了两种方法.

However, if you really do need to use some shared data then multiprocessing provides a couple of ways of doing so.

在您的情况下,您需要以multiprocessing可以理解的某种方式(例如使用multiprocessing.Array)包装l1l2l3,然后将它们作为参数传递.
还请注意,正如您所说的那样,您不需要写访问权限,那么在创建对象时应传递lock=False,否则所有访问权限仍将被序列化.

In your case, you need to wrap l1, l2 and l3 in some way understandable by multiprocessing (e.g. by using a multiprocessing.Array), and then pass them as parameters.
Note also that, as you said you do not need write access, then you should pass lock=False while creating the objects, or all access will be still serialized.

这篇关于多处理中的共享内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆