使用h5py增量写入hdf5 [英] Incremental writes to hdf5 with h5py

查看:347
本文介绍了使用h5py增量写入hdf5的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个问题,关于如何最好地使用python/h5py写入hdf5文件.

I have got a question about how best to write to hdf5 files with python / h5py.

我有类似的数据:

-----------------------------------------
| timepoint | voltage1 | voltage2 | ...
-----------------------------------------
| 178       | 10       | 12       | ...
-----------------------------------------
| 179       | 12       | 11       | ...
-----------------------------------------
| 185       | 9        | 12       | ...
-----------------------------------------
| 187       | 15       | 12       | ...
                    ...

具有约10 ^ 4列和约10 ^ 7行. (这大约是10 ^ 11(1000亿个)元素,即1个字节整数的约100GB).

with about 10^4 columns, and about 10^7 rows. (That's about 10^11 (100 billion) elements, or ~100GB with 1 byte ints).

使用此数据,通常的用途是一次写入,多次读取,并且典型的读取情况是抓取第1列和另一列(例如254),将两列都加载到内存中,并进行一些统计.

With this data, typical use is pretty much write once, read many times, and the typical read case would be to grab column 1 and another column (say 254), load both columns into memory, and do some fancy statistics.

我认为一个好的hdf5结构将使上表中的每一列成为hdf5组,从而形成10 ^ 4组.这样,我们将不需要将所有数据读入内存,是吗? hdf5结构尚未定义,因此可以是任何东西.

I think a good hdf5 structure would thus be to have each column in the table above be a hdf5 group, resulting in 10^4 groups. That way we won't need to read all the data into memory, yes? The hdf5 structure isn't yet defined though, so it can be anything.

现在的问题是: 我一次接收到约10 ^ 4行的数据(并且每次接收的行数不完全相同),并且需要将其递增地写入hdf5文件.我该如何写该文件?

Now the question: I receive the data ~10^4 rows at a time (and not exactly the same numbers of rows each time), and need to write it incrementally to the hdf5 file. How do I write that file?

我正在考虑使用python和h5py,但如果推荐的话,可以使用其他工具.例如,正在分拆要走的路

I'm considering python and h5py, but could another tool if recommended. Is chunking the way to go, with e.g.

dset = f.create_dataset("voltage284", (100000,), maxshape=(None,), dtype='i8', chunks=(10000,))

然后当另一个10 ^ 4行的块到达时,替换数据集吗?

and then when another block of 10^4 rows arrives, replace the dataset?

还是将10 ^ 4行的每个块存储为单独的数据集会更好吗?还是我真的需要知道最终的行数? (要获得它很棘手,但可能).

Or is it better to just store each block of 10^4 rows as a separate dataset? Or do I really need to know the final number of rows? (That'll be tricky to get, but maybe possible).

如果它也不是正确的工具,我也可以保释hdf5,尽管我认为一旦完成尴尬的写操作,那就太好了.

I can bail on hdf5 if it's not the right tool for the job too, though I think once the awkward writes are done, it'll be wonderful.

推荐答案

根据FAQ ,您可以使用dset.resize扩展数据集.例如,

Per the FAQ, you can expand the dataset using dset.resize. For example,

import os
import h5py
import numpy as np
path = '/tmp/out.h5'
os.remove(path)
with h5py.File(path, "a") as f:
    dset = f.create_dataset('voltage284', (10**5,), maxshape=(None,),
                            dtype='i8', chunks=(10**4,))
    dset[:] = np.random.random(dset.shape)        
    print(dset.shape)
    # (100000,)

    for i in range(3):
        dset.resize(dset.shape[0]+10**4, axis=0)   
        dset[-10**4:] = np.random.random(10**4)
        print(dset.shape)
        # (110000,)
        # (120000,)
        # (130000,)

这篇关于使用h5py增量写入hdf5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆