连接大量的HDF5文件 [英] Concatenate a large number of HDF5 files
问题描述
我有大约500个HDF5文件,每个文件大约有1.5 GB。
I have about 500 HDF5 files each of about 1.5 GB.
每个文件具有相同的精确结构,这是7个复合(int,double,double )数据集和可变数量的样本。
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
现在我想通过连接每个数据集来连接所有这些文件,以便最终我有一个750 GB的文件,我的7个数据集。
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
目前我正在运行一个h5py脚本:
Currently I am running a h5py script which:
- 创建一个HDF5文件,正确的数据集无限制最大值
- 按顺序打开所有文件
- 检查样本数量(原样是
-
- 调整全局文件的大小
- 附加数据
- creates a HDF5 file with the right datasets of unlimited max
- open in sequence all the files
- check what is the number of samples (as it is variable)
- resize the global file
- append the data
这显然需要很多时间,
你会有改进意见的建议吗?
this obviously takes many hours, would you have a suggestion about improving this?
我在一个集群上工作,所以我可以使用HDF5并行地,但是我不够好在 C 编程中实现我自己需要的东西ool已经写了。
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
推荐答案
我发现大部分时间花在调整文件大小,因为我正在调整每一步,所以我现在第一次通过我的所有文件,并获得他们的长度(它是可变的)。
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
然后我创建全局h5file设置总长度为总和所有的文件。
Then I create the global h5file setting the total length to the sum of all the files.
只有在这个阶段之后,我才能使用所有小文件的数据填写h5文件。
Only after this phase I fill the h5file with the data from all the small files.
现在每个文件需要大约10秒钟,所以它需要不到2个小时,而之前需要更多的时间。
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
这篇关于连接大量的HDF5文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!