如何将巨大的 2D NumPy 数组写入缓冲区 [英] How to write a huge 2D NumPy array into a buffer
问题描述
我有一个巨大的 2D numpy 数组 (dtype=bool) 和一个缓冲区,我想将此 2D 数组写入缓冲区.目前,我执行以下操作,
# Python 3.7.7 版,NumPy 1.18.5 版# dummy_array 中的形状只是一个例子,有时会更大dummy_array = np.array(np.empty((599066148, 213), dtype='bool'), dtype='bool')# Pyarrow 血浆储存缓冲液buf = client.create(object_id, dummy_array.nbytes)# 获取缓冲区的 NumPy 视图数组 = np.frombuffer(buf, dtype=bool").reshape(dummy_array.shape)# 将数据或NumPy数组写入缓冲区数组[:] = dummy_array
问题是这至少需要 3 分钟.dummy_array
的大小通常为 100 到 200GB,有时甚至更多.我无法弄清楚如何使用 memoryview
和 np.ctypeslib.as_array(buf, shape=dummy_array.shape)
来做到这一点,如本 问题 二维数组(我试过,但没有用).任何以更好或更快的方式执行此操作的指针都会很棒,因为我将至少执行数百次,因此,即使每次迭代节省 30 到 60 秒也会节省数小时.
您不能分配多维内存视图切片.
NotImplementedError: memoryview slice 分配目前仅限于 ndim = 1
因此,在将数组复制到 memoryview 之前,您可能需要将其重塑为一维数组.
<预><代码>>>>dummy_array = np.array(np.empty((2, 213), dtype='bool'), dtype='bool').reshape(2*213)>>>内存 = 内存视图(dummy_array)>>>内存[0]真的>>>np.frombuffer(mem, dtype="bool").reshape(dummy_array.shape)数组([真,真,真,真,真,真,假,假,真,如果您尝试使用多维,则会出现此错误.
<预><代码>>>>dummy_array = np.array(np.empty((2, 213), dtype='bool'), dtype='bool')>>>内存 = 内存视图(dummy_array)>>>内存[0]回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中.NotImplementedError:未实现多维子视图我不能告诉你这是否会比你的其他方法更快,但它可能会给你一些关于如何让 memoryview 版本工作的想法.
I have a huge 2D numpy array (dtype=bool) and a buffer and I would like to write this 2D array into the buffer. Currently, I do the following,
# Python version 3.7.7, NumPy version 1.18.5
# shape in the dummy_array is just an example, sometimes will be bigger
dummy_array = np.array(np.empty((599066148, 213), dtype='bool'), dtype='bool')
# Pyarrow plasma store buffer
buf = client.create(object_id, dummy_array.nbytes)
# Get a NumPy view of the buffer
array = np.frombuffer(buf, dtype="bool").reshape(dummy_array.shape)
# Write the data or the NumPy array to the buffer
array[:] = dummy_array
The problem is that this takes at least 3 minutes. The size of the dummy_array
is usually 100 to 200GB and sometimes even more. I could not figure out how to do this using memoryview
and np.ctypeslib.as_array(buf, shape=dummy_array.shape)
as mentioned in this question for a 2D array (I tried, but it did not work). Any pointers to do this in a better or faster way would be great because I will be doing this at least few hundred times, so, saving even 30 to 60 seconds per iteration would save hours.
You cannot assign a multi dimensioned memoryview slice.
NotImplementedError: memoryview slice assignments are currently restricted to ndim = 1
So you might need to reshape your array to be one dimensional before copying it into a memoryview.
>>> dummy_array = np.array(np.empty((2, 213), dtype='bool'), dtype='bool').reshape(2*213)
>>> mem = memoryview(dummy_array)
>>> mem[0]
True
>>> np.frombuffer(mem, dtype="bool").reshape(dummy_array.shape)
array([ True, True, True, True, True, True, False, False, True,
If you try to use multidimensional, you'll get this error.
>>> dummy_array = np.array(np.empty((2, 213), dtype='bool'), dtype='bool')
>>> mem = memoryview(dummy_array)
>>> mem[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NotImplementedError: multi-dimensional sub-views are not implemented
I can't tell you if this will be faster than your other method but it may give you some ideas for how to get the memoryview version working.
这篇关于如何将巨大的 2D NumPy 数组写入缓冲区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!