如何将类似"numpy ndarray"的对象延迟连接以进行顺序读取? [英] How do I lazily concatenate "numpy ndarray"-like objects for sequential reading?
问题描述
我有几个大型hdf5文件的列表,每个文件都有4D数据集.我想在第一轴上将它们连接起来,例如在一个类似数组的对象中,该对象将被用作所有数据集被连接在一起的方式.我的最终目的是多次沿同一轴顺序读取数据块(例如[0:100,:,:,:]
,[100:200,:,:,:]
,...).
I have a list of several large hdf5 files, each with a 4D dataset. I would like to obtain a concatenation of them on the first axis, as in, an array-like object that would be used as if all datasets were concatenated. My final intent is to sequentially read chunks of the data along the same axis (e.g. [0:100,:,:,:]
, [100:200,:,:,:]
, ...), multiple times.
h5py中的数据集共享numpy数组API的重要部分,这使我可以调用 numpy.concatenate
完成任务:
Datasets in h5py share a significant part of the numpy array API, which allows me to call numpy.concatenate
to get the job done:
files = [h5.File(name, 'r') for name in filenames]
X = np.concatenate([f['data'] for f in files], axis=0)
另一方面,内存布局不相同,并且无法在其中共享内存(源代码数组串联功能可以确认这一点.
On the other hand, the memory layout is not the same, and memory cannot be shared among them (related question). Alas, concatenate
will eagerly copy the entire content of each array-like object into a new array, which I cannot accept in my use case. The source code of the array concatenation function confirms this.
如何获得多个类似数组的对象的连接视图,而没有急切地将它们读取到内存中?就此视图而言,对该视图进行切片和建立索引的行为就好像我有一个串联数组一样.
How can I obtain a concatenated view over multiple array-like objects, without eagerly reading them to memory? As far as this view is concerned, slicing and indexing over this view would behave just as if I had a concatenated array.
我可以想象编写一个自定义包装器是可行的,但是我想知道这样的实现是否已经以库的形式存在,或者是否存在解决该问题的另一种解决方案,并且是否可行.到目前为止,我的搜索没有产生任何这种结果.我也愿意接受特定于h5py的解决方案.
I can imagine that writing a custom wrapper would work, but I would like to know whether such an implementation already exists as a library, or whether another solution to the problem is available and just as feasible. My searches so far have yielded nothing of this sort. I am also willing to accept solutions specific to h5py.
推荐答案
flist = [f['data'] for f in files]
是dataset
对象的列表.实际数据在h5
文件上,只要这些文件保持打开状态就可以访问.
flist = [f['data'] for f in files]
is a list of dataset
objects. The actual data is on the h5
files, is accessible as long as those files remain open.
完成时
arr = np.concatenate(flist, axis=0)
我想concatenate
首先是
tmep = [np.asarray(a) for a in flist]
,即构造一个numpy数组的列表.我假设np.asarray(f['data'])
与f['data'].value
或f['data'][:]
相同(正如我在两年前在链接的SO问题中所讨论的那样).我应该做一些时间测试,将其与
that is, construct a list of numpy arrays. I assume np.asarray(f['data'])
is the same as f['data'].value
or f['data'][:]
(as I discussed 2 yrs ago in the linked SO question). I should do some time tests comparing that with
arr = np.concatenate([a.value for a in flist], axis=0)
flist
是这些数据集的一种惰性编译,因为数据仍然驻留在文件中,并且仅在您执行其他操作时才可以访问.
flist
is a kind of lazy compilation of these data sets, in that the data still resides on the file, and is accessed only when you do something more.
[a.value[:,:,:10] for a in flist]
将每个数据集的一部分加载到内存中;我希望该列表上的连接等同于arr[:,:,:10]
.
would load a portion of each of those data sets into memory; I expect that a concatenate on that list would be the equivalent of arr[:,:,:10]
.
生成器或生成器理解是一种惰性评估,但是我认为在concatenate
中使用它们之前必须将它们转换为列表.无论如何,concatenate
的结果始终是一个数组,其中所有数据都在连续的内存块中.它永远不会是文件中的数据块.
Generators or generator comprehensions are a form of lazy evaluation, but I think they have to be turned into lists before use in concatenate
. In any case, the result of concatenate
is always an array with all the data in a contiguous block of memory. It is never blocks of data residing in files.
您需要告诉我们更多有关此大型串联数据集打算做什么的信息.作为概述,我认为您可以构造包含所有数据集切片的数组.您还可以执行其他操作,如我在上一个答案中所演示的那样-但要花费访问时间.
You need to tell us more about what intend to do with this large concatenated array of data sets. As outline I think you can construct arrays that contain slices of all the data sets. You could also perform other actions as I demonstrate in the previous answer - but with an access time cost.
这篇关于如何将类似"numpy ndarray"的对象延迟连接以进行顺序读取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!