来自多个HDF5文件/数据集的链数据集 [英] Chain datasets from multiple HDF5 files/datasets

查看:1039
本文介绍了来自多个HDF5文件/数据集的链数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

c $ c> h5py 的优点和简单的映射(通过HDF5)提供了在磁盘上持久存储的数据集是非常的。我对一组文件进行一些分析,并将结果存储到数据集中,每个文件一个。在此步骤结束时,我有一组包含2D数组的 h5py.Dataset 对象。数组都具有相同的列数,但不同的行数,即(A,N),(B,N),(C,N)等。



我现在想要将这些多个2D阵列作为单阵列2D阵列访问。也就是说,我想按需读取它们的形状(A + B + C,N)数组。



为此目的, h5py.Link 类在HDF5节点级别上不起作用。



这是一些伪代码:

  import numpy as np 
import h5py
a = h5py.Dataset('a',data = np.random.random((100,50)))
b = h5py.Dataset('b',data = np.random 。($($)$)
c = h5py.Dataset('c',data = np.random.random((253,50)))

#我想将这些数组视为单个数组
combined = magic_array_linker([a,b,c],axis = 1)
assert combined.shape ==(100 + 300 + 253,50)

为了我的目的,将数组复制到新文件中的建议不起作用。我也可以在numpy级别解决这个问题,但是我没有找到任何适合的选项,其中包含 numpy.view numpy.concatenate 这个工作没有拷贝数据。



有没有人知道一种方法来查看多个数组作为堆叠的数组,而不需要复制和从 h5py.Dataset

解决方案

首先,我不认为有一种方法可以不复制数据来返回单个数组。据我所知,不可能将numpy视图连接到一个数组中 - 除非您自己创建自己的包装器。



这里我演示了一个证明概念使用对象/区域引用。基本的前提是,我们在文件中创建一个新的数据集,该数据集是对组成子数组的引用数组。通过存储这样的引用,子数组可以动态地改变大小,并且对包装器进行索引将始终索引正确的子阵列。



由于这只是一个概念证明,实施适当的切片,只是非常简单的索引。也没有尝试错误检查 - 这几乎肯定会在生产中崩溃。

  class MagicArray(object):
Magical索引引用数组

def __init __(self,file,references,axis = 0):
self.file = file
self.references = references
self.axis = axis

def __getitem __(self,items):
#我们需要修改索引,所以确保项目是列表
items = list(items)

项目中的项目:
如果hasattr(item,'start'):
#items是一个切片对象
raise ValueError('Slices not implemented')

for ref in self.references:
size = self.file [ref] .shape [self.axis]

#检查请求的索引是否在这个子列表中
#如果没有,减去子阵列e并移动
如果项目[self.axis]<大小:
item_ref = ref
break
else:
items [self.axis] = items [self.axis] - size

return self。文件[item_ref] [tuple(items)]

以下是使用它的方式:

  with h5py.File(/ tmp / so_hdf5 / test.h5,'w')as f:
a = f.create_dataset('a',data = np.random.random((100,50)))
b = f.create_dataset('b',data = np.random.random((300 ,50)))
c = f.create_dataset('c',data = np.random.random((253,50)))

ref_dtype = h5py.special_dtype(ref = h5py参考)
ref_dataset = f.create_dataset(refs,(3,),dtype = ref_dtype)
$ b对于i,键入枚举([a,b,c]):
ref_dataset [i] = key.ref

with h5py.File(/ tmp / so_hdf5 / test.h5,'r')as f:
foo = MagicArray (f,f ['refs'],axis = 0)
print(foo [104,4])
print(f ['b '] [4,4])






相当微不足道,扩展到鸽友索引(即可以处理切片),但是我无法看到如何在没有复制数据的情况下执行此操作。



您可能可以从 numpy .ndarray ,并获得所有常用的方法。


The benefits and simplistic mapping that h5py provides (through HDF5) for persisting datasets on disk is exceptional. I run some analysis on a set of files and store the result into a dataset, one for each file. At the end of this step, I have a set of h5py.Dataset objects which contain 2D arrays. The arrays all have the same number of columns, but different number of rows, i.e., (A,N), (B,N), (C,N), etc.

I would now like to access these multiple 2D arrays as a single array 2D array. That is, I would like to read them on-demand as an array of shape (A+B+C, N).

For this purpose, h5py.Link classes do not help as it works at the level of HDF5 nodes.

Here is some pseudocode:

import numpy as np
import h5py
a = h5py.Dataset('a',data=np.random.random((100, 50)))
b = h5py.Dataset('b',data=np.random.random((300, 50)))
c = h5py.Dataset('c',data=np.random.random((253, 50)))

# I want to view these arrays as a single array
combined = magic_array_linker([a,b,c], axis=1)
assert combined.shape == (100+300+253, 50)

For my purposes, suggestions of copying the arrays into a new file do not work. I'm also open to solving this on the numpy level, but I don't find any suitable options with numpy.view or numpy.concatenate that would work without copying out the data.

Does anybody know of a way to view multiple arrays as a stacked set of arrays, without copying and from h5py.Dataset?

解决方案

First up, I don't think there is a way to do this without copying the data in order to return a single array. As far as I can tell, it's not possible to concatenate numpy views into one array - unless, of course, you create your own wrapper.

Here I demonstrate a proof of concept using Object/Region references. The basic premise is that we make a new dataset in the file which is an array of references to the constituent subarrays. By storing references like this, the subarrays can change size dynamically and indexing the wrapper will always index the correct subarrays.

As this is just a proof of concept, I haven't implemented proper slicing, just very simple indexing. There's also no attempt at error checking - this will almost definitely break in production.

class MagicArray(object):
    """Magically index an array of references
    """
    def __init__(self, file, references, axis=0):
        self.file = file
        self.references = references
        self.axis = axis

    def __getitem__(self, items):
        # We need to modify the indices, so make sure items is a list
        items = list(items)

        for item in items:
            if hasattr(item, 'start'):
                # items is a slice object
                raise ValueError('Slices not implemented')

        for ref in self.references:
            size = self.file[ref].shape[self.axis]

            # Check if the requested index is in this subarray
            # If not, subtract the subarray size and move on
            if items[self.axis] < size:
                item_ref = ref
                break
            else:
                items[self.axis] = items[self.axis] - size

        return self.file[item_ref][tuple(items)]

Here's how you use it:

with h5py.File("/tmp/so_hdf5/test.h5", 'w') as f:
    a = f.create_dataset('a',data=np.random.random((100, 50)))
    b = f.create_dataset('b',data=np.random.random((300, 50)))
    c = f.create_dataset('c',data=np.random.random((253, 50)))

    ref_dtype = h5py.special_dtype(ref=h5py.Reference)
    ref_dataset = f.create_dataset("refs", (3,), dtype=ref_dtype)

    for i, key in enumerate([a, b, c]):
        ref_dataset[i] = key.ref

with h5py.File("/tmp/so_hdf5/test.h5", 'r') as f:
    foo = MagicArray(f, f['refs'], axis=0)
    print(foo[104, 4])
    print(f['b'][4,4])


This should be fairly trivial to extend to fancier indexing (i.e. being able to handle slices), but I can't see how to do so without copying data.

You might be able to subclass from numpy.ndarray and get all the usual methods as well.

这篇关于来自多个HDF5文件/数据集的链数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆