使用h5py增量写入hdf5 [英] Incremental writes to hdf5 with h5py

查看：347 发布时间：2020/11/22 1:18:41 python hdf5 h5py

本文介绍了使用h5py增量写入hdf5的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个问题，关于如何最好地使用python/h5py写入hdf5文件.

I have got a question about how best to write to hdf5 files with python / h5py.

我有类似的数据:

-----------------------------------------
| timepoint | voltage1 | voltage2 | ...
-----------------------------------------
| 178       | 10       | 12       | ...
-----------------------------------------
| 179       | 12       | 11       | ...
-----------------------------------------
| 185       | 9        | 12       | ...
-----------------------------------------
| 187       | 15       | 12       | ...
                    ...

具有约10 ^ 4列和约10 ^ 7行. (这大约是10 ^ 11(1000亿个)元素，即1个字节整数的约100GB).

with about 10^4 columns, and about 10^7 rows. (That's about 10^11 (100 billion) elements, or ~100GB with 1 byte ints).

使用此数据，通常的用途是一次写入，多次读取，并且典型的读取情况是抓取第1列和另一列(例如254)，将两列都加载到内存中，并进行一些统计.

With this data, typical use is pretty much write once, read many times, and the typical read case would be to grab column 1 and another column (say 254), load both columns into memory, and do some fancy statistics.

我认为一个好的hdf5结构将使上表中的每一列成为hdf5组，从而形成10 ^ 4组.这样，我们将不需要将所有数据读入内存，是吗? hdf5结构尚未定义，因此可以是任何东西.

I think a good hdf5 structure would thus be to have each column in the table above be a hdf5 group, resulting in 10^4 groups. That way we won't need to read all the data into memory, yes? The hdf5 structure isn't yet defined though, so it can be anything.

现在的问题是: 我一次接收到约10 ^ 4行的数据(并且每次接收的行数不完全相同)，并且需要将其递增地写入hdf5文件.我该如何写该文件?

Now the question: I receive the data ~10^4 rows at a time (and not exactly the same numbers of rows each time), and need to write it incrementally to the hdf5 file. How do I write that file?

我正在考虑使用python和h5py，但如果推荐的话，可以使用其他工具.例如，正在分拆要走的路

I'm considering python and h5py, but could another tool if recommended. Is chunking the way to go, with e.g.

dset = f.create_dataset("voltage284", (100000,), maxshape=(None,), dtype='i8', chunks=(10000,))

然后当另一个10 ^ 4行的块到达时，替换数据集吗?

and then when another block of 10^4 rows arrives, replace the dataset?

还是将10 ^ 4行的每个块存储为单独的数据集会更好吗?还是我真的需要知道最终的行数? (要获得它很棘手，但可能).

Or is it better to just store each block of 10^4 rows as a separate dataset? Or do I really need to know the final number of rows? (That'll be tricky to get, but maybe possible).

如果它也不是正确的工具，我也可以保释hdf5，尽管我认为一旦完成尴尬的写操作，那就太好了.

I can bail on hdf5 if it's not the right tool for the job too, though I think once the awkward writes are done, it'll be wonderful.

使用h5py增量写入hdf5 [英] Incremental writes to hdf5 with h5py

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用h5py增量写入hdf5 [英] Incremental writes to hdf5 with h5py

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭