为什么multiprocessing.sharedctypes分配这么慢? [英] Why are multiprocessing.sharedctypes assignments so slow?

查看:289
本文介绍了为什么multiprocessing.sharedctypes分配这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里有一些基准测试代码来说明我的问题:

Here's a little bench-marking code to illustrate my question:

import numpy as np
import multiprocessing as mp
# allocate memory
%time temp = mp.RawArray(np.ctypeslib.ctypes.c_uint16, int(1e8))
Wall time: 46.8 ms
# assign memory, very slow
%time temp[:] = np.arange(1e8, dtype = np.uint16)
Wall time: 10.3 s
# equivalent numpy assignment, 100X faster
%time a = np.arange(1e8, dtype = np.uint16)
Wall time: 111 ms

基本上,我希望在多个进程之间共享一个numpy数组,因为它很大并且是只读的. 此方法效果很好,没有多余的副本,并且在进程上的实际计算时间也不错.但是创建共享数组的开销很大.

Basically I want a numpy array to be shared between multiple processes because it's big and read-only. This method works great, no extra copies are made and the actual computation time on the processes is good. But the overhead of creating the shared array is immense.

这篇文章提供了一些很好的见解,说明了为什么初始化数组的某些方式很慢(请注意,在上面的示例中,我使用的是更快的方法).但是这篇文章并没有真正描述如何真正提高像表演这样的numpy的速度.

This post offered some great insight into why certain ways of initializing the array are slow (note that in the example above I'm using the faster method). But the post doesn't really describe how to really improve the speed to numpy like performance.

有人对提高速度有什么建议吗?某些cython代码分配数组有意义吗?

Does anyone have any suggestions on how to improve the speed? Would some cython code make sense to allocate the array?

我正在Windows 7 x64系统上工作.

I'm working on a Windows 7 x64 system.

推荐答案

这很慢,原因是您的第二个链接,并且解决方案实际上非常简单:绕过(慢速)RawArray切片分配代码,在这种情况下,该代码每次都无法有效地读取一个原始C值.源数组创建一个Python对象,然后将其直接转换回原始C以便存储在共享数组中,然后丢弃该临时Python对象,并重复1e8次.

This is slow for the reasons given in your second link, and the solution is actually pretty simple: Bypass the (slow) RawArray slice assignment code, which in this case is inefficiently reading one raw C value at a time from the source array to create a Python object, then converts it straight back to raw C for storage in the shared array, then discards the temporary Python object, and repeats 1e8 times.

但是您不必那样做;与大多数C级内容一样,RawArray实现缓冲区协议,这意味着您可以将其转换为

But you don't need to do it that way; like most C level things, RawArray implements the buffer protocol, which means you can convert it to a memoryview, a view of the underlying raw memory that implements most operations in C-like ways, using raw memory operations if possible. So instead of doing:

# assign memory, very slow
%time temp[:] = np.arange(1e8, dtype = np.uint16)
Wall time: 9.75 s  # Updated to what my machine took, for valid comparison

使用memoryview将其作为类似于原始字节的对象进行操作并进行分配(np.arange已经实现了缓冲区协议,并且memoryview的分片分配运算符无缝使用它):

use memoryview to manipulate it as a raw bytes-like object and assign that way (np.arange already implements the buffer protocol, and memoryview's slice assignment operator seamlessly uses it):

# C-like memcpy effectively, very fast
%time memoryview(temp)[:] = np.arange(1e8, dtype = np.uint16)
Wall time: 74.4 ms  # Takes 0.76% of original time!!!

注意,后者的时间是毫秒,而不是秒;使用memoryview包装进行原始内存传输的复制花费的时间少于1%的时间,而默认情况下使用RawArray进行叠加的方式进行复制!

Note, the time for the latter is milliseconds, not seconds; copying using memoryview wrapping to perform raw memory transfers takes less than 1% of the time to do it the plodding way RawArray does it by default!

这篇关于为什么multiprocessing.sharedctypes分配这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆