多重处理:为什么在复制列表时与子进程共享一个numpy数组? [英] Multiprocessing: why is a numpy array shared with the child processes, while a list is copied?

查看:99
本文介绍了多重处理:为什么在复制列表时与子进程共享一个numpy数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用了此脚本(请参阅末尾的代码)以评估在派生父进程时是共享还是复制了全局对象.

I used this script (see code at the end) to assess whether a global object is shared or copied when the parent process is forked.

简而言之,该脚本创建了一个全局data对象,并且子进程在data上进行迭代.该脚本还监视内存使用情况,以评估对象是否已在子进程中复制.

Briefly, the script creates a global data object, and the child processes iterate over data. The script also monitors the memory usage to assess whether the object was copied in the child processes.

以下是结果:

  1. data = np.ones((N,N)).在子进程中的操作: data.sum().结果:data共享(无副本)
  2. data = list(range(pow(10, 8))).子进程中的操作:sum(data).结果:data已被复制.
  3. data = list(range(pow(10, 8))).子进程中的操作:for x in data: pass.结果:data已被复制.
  1. data = np.ones((N,N)). Operation in the child process: data.sum(). Result: data is shared (no copy)
  2. data = list(range(pow(10, 8))). Operation in the child process: sum(data). Result: data is copied.
  3. data = list(range(pow(10, 8))). Operation in the child process: for x in data: pass. Result: data is copied.

由于写时复制,所以预期结果1).我对结果2)和3)感到有些困惑.为什么要复制data?

Result 1) is expected because of copy-on-write. I am a bit puzzled by the results 2) and 3). Why is data copied?

脚本

import multiprocessing as mp
import numpy as np
import logging
import os

logger = mp.log_to_stderr(logging.WARNING)

def free_memory():
    total = 0
    with open('/proc/meminfo', 'r') as f:
        for line in f:
            line = line.strip()
            if any(line.startswith(field) for field in ('MemFree', 'Buffers', 'Cached')):
                field, amount, unit = line.split()
                amount = int(amount)
                if unit != 'kB':
                    raise ValueError(
                        'Unknown unit {u!r} in /proc/meminfo'.format(u = unit))
                total += amount
    return total

def worker(i):
    x = data.sum()    # Exercise access to data
    logger.warn('Free memory: {m}'.format(m = free_memory()))

def main():
    procs = [mp.Process(target = worker, args = (i, )) for i in range(4)]
    for proc in procs:
        proc.start()
    for proc in procs:
        proc.join()

logger.warn('Initial free: {m}'.format(m = free_memory()))
N = 15000
data = np.ones((N,N))
logger.warn('After allocating data: {m}'.format(m = free_memory()))

if __name__ == '__main__':
    main()

详细结果

运行1个输出

[WARNING/MainProcess] Initial free: 25.1 GB [WARNING/MainProcess] After allocating data: 23.3 GB [WARNING/Process-2] Free memory: 23.3 GB [WARNING/Process-4] Free memory: 23.3 GB [WARNING/Process-1] Free memory: 23.3 GB [WARNING/Process-3] Free memory: 23.3 GB

[WARNING/MainProcess] Initial free: 25.1 GB [WARNING/MainProcess] After allocating data: 23.3 GB [WARNING/Process-2] Free memory: 23.3 GB [WARNING/Process-4] Free memory: 23.3 GB [WARNING/Process-1] Free memory: 23.3 GB [WARNING/Process-3] Free memory: 23.3 GB

运行2个输出

[WARNING/MainProcess] Initial free: 25.1 GB [WARNING/MainProcess] After allocating data: 21.9 GB [WARNING/Process-2] Free memory: 12.6 GB [WARNING/Process-4] Free memory: 12.7 GB [WARNING/Process-1] Free memory: 16.3 GB [WARNING/Process-3] Free memory: 17.1 GB

[WARNING/MainProcess] Initial free: 25.1 GB [WARNING/MainProcess] After allocating data: 21.9 GB [WARNING/Process-2] Free memory: 12.6 GB [WARNING/Process-4] Free memory: 12.7 GB [WARNING/Process-1] Free memory: 16.3 GB [WARNING/Process-3] Free memory: 17.1 GB

运行3个输出

[WARNING/MainProcess] Initial free: 25.1 GB [WARNING/MainProcess] After allocating data: 21.9 GB [WARNING/Process-2] Free memory: 12.6 GB [WARNING/Process-4] Free memory: 13.1 GB [WARNING/Process-1] Free memory: 14.6 GB [WARNING/Process-3] Free memory: 19.3 GB

[WARNING/MainProcess] Initial free: 25.1 GB [WARNING/MainProcess] After allocating data: 21.9 GB [WARNING/Process-2] Free memory: 12.6 GB [WARNING/Process-4] Free memory: 13.1 GB [WARNING/Process-1] Free memory: 14.6 GB [WARNING/Process-3] Free memory: 19.3 GB

推荐答案

它们都是写时复制的.您所缺少的是,例如,

They're all copy-on-write. What you're missing is that when you do, e.g.,

for x in data:
    pass

data中包含的每个对象的引用计数都暂时增加1,因为x依次绑定到每个对象.对于int对象,CPython中的refcount是基本对象布局的一部分,因此该对象将被复制(您 did 对其进行了更改,因为refcount发生了变化).

the reference count on every object contained in data is temporarily incremented by 1, one at a time, as x is bound to each object in turn. For int objects, the refcount in CPython is part of the basic object layout, so the object gets copied (you did mutate it, because the refcount changes).

要使事情更类似于numpy.ones情况,请尝试例如

To make something more analogous to the numpy.ones case, try, e.g.,

data = [1] * 10**8

然后,只有一个唯一对象被列表引用多次(10**8),因此几乎没有要复制的内容(同一对象的refcount多次增加和减少).

Then there's only a single unique object referenced many (10**8) times by the list, so there's very little to copy (the same object's refcount gets incremented and decremented many times).

这篇关于多重处理:为什么在复制列表时与子进程共享一个numpy数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆