如何将巨大的 2D NumPy 数组写入缓冲区 [英] How to write a huge 2D NumPy array into a buffer

查看:118
本文介绍了如何将巨大的 2D NumPy 数组写入缓冲区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的 2D numpy 数组 (dtype=bool) 和一个缓冲区,我想将此 2D 数组写入缓冲区.目前,我执行以下操作,

# Python 3.7.7 版,NumPy 1.18.5 版# dummy_array 中的形状只是一个例子,有时会更大dummy_array = np.array(np.empty((599066148, 213), dtype='bool'), dtype='bool')# Pyarrow 血浆储存缓冲液buf = client.create(object_id, dummy_array.nbytes)# 获取缓冲区的 NumPy 视图数组 = np.frombuffer(buf, dtype=bool").reshape(dummy_array.shape)# 将数据或NumPy数组写入缓冲区数组[:] = dummy_array

问题是这至少需要 3 分钟.dummy_array 的大小通常为 100 到 200GB,有时甚至更多.我无法弄清楚如何使用 memoryviewnp.ctypeslib.as_array(buf, shape=dummy_array.shape) 来做到这一点,如本 问题 二维数组(我试过,但没有用).任何以更好或更快的方式执行此操作的指针都会很棒,因为我将至少执行数百次,因此,即使每次迭代节省 30 到 60 秒也会节省数小时.

解决方案

您不能分配多维内存视图切片.

NotImplementedError: memoryview slice 分配目前仅限于 ndim = 1

因此,在将数组复制到 memoryview 之前,您可能需要将其重塑为一维数组.

<预><代码>>>>dummy_array = np.array(np.empty((2, 213), dtype='bool'), dtype='bool').reshape(2*213)>>>内存 = 内存视图(dummy_array)>>>内存[0]真的>>>np.frombuffer(mem, dtype="bool").reshape(dummy_array.shape)数组([真,真,真,真,真,真,假,假,真,

如果您尝试使用多维,则会出现此错误.

<预><代码>>>>dummy_array = np.array(np.empty((2, 213), dtype='bool'), dtype='bool')>>>内存 = 内存视图(dummy_array)>>>内存[0]回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中.NotImplementedError:未实现多维子视图

我不能告诉你这是否会比你的其他方法更快,但它可能会给你一些关于如何让 memoryview 版本工作的想法.

I have a huge 2D numpy array (dtype=bool) and a buffer and I would like to write this 2D array into the buffer. Currently, I do the following,

# Python version 3.7.7, NumPy version 1.18.5
# shape in the dummy_array is just an example, sometimes will be bigger
dummy_array = np.array(np.empty((599066148, 213), dtype='bool'), dtype='bool')

# Pyarrow plasma store buffer
buf = client.create(object_id, dummy_array.nbytes)

# Get a NumPy view of the buffer
array = np.frombuffer(buf, dtype="bool").reshape(dummy_array.shape)

# Write the data or the NumPy array to the buffer
array[:] = dummy_array

The problem is that this takes at least 3 minutes. The size of the dummy_array is usually 100 to 200GB and sometimes even more. I could not figure out how to do this using memoryview and np.ctypeslib.as_array(buf, shape=dummy_array.shape) as mentioned in this question for a 2D array (I tried, but it did not work). Any pointers to do this in a better or faster way would be great because I will be doing this at least few hundred times, so, saving even 30 to 60 seconds per iteration would save hours.

解决方案

You cannot assign a multi dimensioned memoryview slice.

NotImplementedError: memoryview slice assignments are currently restricted to ndim = 1

So you might need to reshape your array to be one dimensional before copying it into a memoryview.

>>> dummy_array = np.array(np.empty((2, 213), dtype='bool'), dtype='bool').reshape(2*213)
>>> mem = memoryview(dummy_array)
>>> mem[0]
True
>>> np.frombuffer(mem, dtype="bool").reshape(dummy_array.shape)
array([ True,  True,  True,  True,  True,  True, False, False,  True,

If you try to use multidimensional, you'll get this error.

>>> dummy_array = np.array(np.empty((2, 213), dtype='bool'), dtype='bool')
>>> mem = memoryview(dummy_array)
>>> mem[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: multi-dimensional sub-views are not implemented

I can't tell you if this will be faster than your other method but it may give you some ideas for how to get the memoryview version working.

这篇关于如何将巨大的 2D NumPy 数组写入缓冲区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆