如何使用NumPy ndarray共享HDF5数据集中的内存 [英] How to share memory from an HDF5 dataset with a NumPy ndarray

查看:120
本文介绍了如何使用NumPy ndarray共享HDF5数据集中的内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个应用程序,用于从传感器流式传输数据,然后以各种方式处理数据.这些处理组件包括可视化数据,进行一些数字运算(线性代数)以及将数据以HDF5格式写入磁盘.理想情况下,这些组件中的每一个都是自己的模块,所有组件都在同一Python进程中运行,因此IPC不会成为问题.这使我想到了如何有效存储流数据的问题.

I am writing an application for streaming data from a sensor, and then processing the data in various ways. These processing components include visualizing the data, some number crunching (linear algebra), and also writing the data to disk in an HDF5 format. Ideally each of these components will be its own module, all run in the same Python process so that IPC is not an issue. This leads me to the question of how to efficiently store the streaming data.

数据集非常大(〜5Gb),因此我想通过在需要访问的组件之间共享数据来最大程度地减少内存中数据的副本数量.如果所有组件都直接使用ndarray,那么这应该很简单:给其中一个进程提供数据,然后使用ndarray.view()给其他所有人一个副本.

The datasets are quite large (~5Gb), and so I would like to minimize the number of copies of the data in memory by sharing it between the components that need access. If all components used straight ndarrays, then this should be straightforward: give one of the processes the data, then give everyone else a copy using ndarray.view().

但是,将数据写入磁盘的组件将数据存储在HDF5 Dataset中.它们可以通过许多方式与ndarray互操作,但是创建view()似乎并不像ndarrary那样工作.

However, the component writing data to disk stores the data in an HDF5 Dataset. These are interoperable with ndarrays in lots of ways, but it doesn't appear that creating a view() works as with ndarrarys.

ndarray s观察:

>>> source = np.zeros((10,))
>>> view = source.view()
>>> source[0] = 1
>>> view[0] == 1
True
>>> view.base is source
True

但是,这不适用于HDF5 Dataset:

However, this doesn't work with HDF5 Datasets:

>>> import h5py
>>> file = h5py.File('source.h5', 'a')
>>> source_dset = file.create_dataset('source', (10,), dtype=np.float64)
>>> view_dset = source_dset.value.view()
>>> source_dset[0] = 1
>>> view_dset[0] == 1
False
>>> view_dset.base is source_dset.value
False

仅分配Dataset.value本身而不是其view也不起作用.

It also doesn't work to just assign the Dataset.value itself, not a view of it.

>>> view_dset = source_dset.value
>>> source_dset[0] = 2
>>> view_dset[0] == 2
False
>>> view_dset.base is source_dset.value
False

所以我的问题是:有没有办法让ndarray与HDF5 Dataset共享内存,就像两个ndarray可以共享内存一样?

So my question is this: Is there a way to have an ndarray share memory with an HDF5 Dataset, just as two ndarrays can share memory?

我的猜测是,这不太可能起作用,可能是因为HDF5在内存中存储数组的方式有些微妙.但这让我有些困惑,尤其是type(source_dset.value) == numpy.ndarrayDataset.value.view()OWNDATA标志实际上是False.谁拥有view解释的记忆?

My guess is that this is unlikely to work, probably because of some subtlety in how HDF5 stores arrays in memory. But it is a bit confusing to me, especially that type(source_dset.value) == numpy.ndarray and the OWNDATA flag of Dataset.value.view() is actually False. Who owns the memory that the view is interpreting?

版本详细信息:Python 3,NumPy版本1.9.1,h5py版本2.3.1,HDF5版本1.8.13,Linux.

Version details: Python 3, NumPy version 1.9.1, h5py version 2.3.1, HDF5 version 1.8.13, Linux.

其他详细信息:HDF5文件已分块.

Other details: HDF5 file is chunked.

编辑:

经过更多处理之后,似乎一种可行的解决方案是为其他组件提供对HDF5 Dataset本身的引用.似乎并没有复制任何内存(至少不是根据top所述),并且源Dataset中的更改会反映在视图中.

After playing around with this a bit more, it seems like one possible solution is to give other components a reference to the HDF5 Dataset itself. This doesn't seem to copy any memory (at least not according to top), and changes in the source Dataset are reflected in the view.

>>> import h5py
>>> file = h5py.File('source.h5', 'a')
>>> source = file.create_dataset('source', (10,), dtype=np.float64)
>>> class Container():
    ...    def __init__(self, source_dset):
    ...        self.dset = source_dset
    ...
>>> container = Containter(source)
>>> source[0] = 1
>>> container.dset[0] == 1
True

我对此解决方案感到满意(只要可以节省内存),但是我仍然很好奇为什么上述view方法不起作用.

I'm reasonably happy with this solution (as long as the memory savings pan out), but I'm still curious why the view approach above doesn't work.

推荐答案

简短的答案是,您不能在numpy数组和h5py数据集之间共享内存.尽管它们具有相似的API(至少在索引方面),但是它们没有兼容的内存布局.实际上,除了高速缓存之外,数据集甚至不在内存中,而是在文件中.

The short answer is that you can't share memory between a numpy array and an h5py dataset. While they have similar API (at least when it comes to indexing), they don't have a compatible memory layout. In fact, apart from sort of cache, the dataset isn't even in memory - it's on the file.

首先,我不明白为什么需要将source.view()numpy数组一起使用.是的,当从数组中选择或调整数组的形状时,numpy尝试返回view而不是副本.但是.view的大多数(全部?)示例都涉及某种转换,例如使用dtype.您可以仅用.view()指向代码或文档示例吗?

First, I don't see why you need to use source.view() with a numpy array. Yes, when selecting from an array or reshaping an array, numpy tries to return a view as opposed to a copy. But most (all?) examples of .view involve some sort of transformation, such as with a dtype. Can you point to code or documentation example with just .view()?

我对h5py没有太多的经验,但是它的文档讨论了它为h5文件对象提供了一个类似于ndarray的包装器.您的DataSet不是ndarray.例如,它缺少许多ndarray方法,包括view.

I don't have much experience with h5py, but its documentation talks about it providing a thin ndarray-like wrapper around h5 file objects. Your DataSet is not an ndarray. For example it lacks many of the ndarray methods, including view.

但是索引DataSet会返回ndarray,例如view_dset[:]. .value也是如此.其文档的第一部分(通过IPython中的view_dset.value??):

But indexing a DataSet returns an ndarray, e.g. view_dset[:]. So does .value. The first part of its documentation (via view_dset.value?? in IPython):

Type:            property
String form:     <property object at 0xb4ee37d4>
Docstring:       Alias for dataset[()] 
...

请注意,当您为数据集分配新值时,必须直接索引source_dset.索引值不起作用-除了修改数组外.它不会更改文件对象.

Note that when you assign new values to the DataSet you have to index source_dset directly. Indexing the value does not work - except to modify the array. It doesn't change the file object.

从数组创建数据集并不会更紧密地链接它们:

And creating a dataset from an array does not link them any tighter:

x = np.arange(10)
xdset = file.create_dataset('x', data=x)
x1 = xdset[:]

xxdsetx1都是独立的-更改一个不会更改其他.

x, xdset and x1 are all independent - changing one does not change the others.

关于时间,比较

timeit np.sum(x)  #  11.7 µs
timeit np.sum(xdset) # 203 µs
timeit xdset.value #  173 µs
timeit np.sum(x1)  # same as for x

数组的sum比数据集的要快得多.大部分额外的时间都涉及从数据集中创建数组.

The sum of an array is much faster than for a dataset. Most of the extra time is involved in creating an array from the dataset.

这篇关于如何使用NumPy ndarray共享HDF5数据集中的内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆