插入许多HDF5数据集非常慢 [英] Inserting Many HDF5 Datasets Very Slow

查看:201
本文介绍了插入许多HDF5数据集非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将许多数据集插入组中时,速度会急剧下降.

There is a dramatic slowdown when inserting many datasets into a group.

我发现减速点与名称的长度和数据集的数量成正比.较大的数据集确实需要花费更长的时间插入,但并不会影响减速发生的时间.

I have found that the slowdown point is proportional to the length of the name and number of datasets. A larger dataset does take a bit longer to insert but it didn't affect when the slowdown occurred.

下面的示例夸大了名称的长度,只是为了说明要点,而无需等待很长时间.

The following example exaggerates the length of the name just to illustrate the point without waiting a long time.

  • Python 3
  • HDF5版本1.8.15(1.10.1变得更慢)
  • h5py版本:2.6.0

示例:

import numpy as np
import h5py
import time

hdf = h5py.File('dummy.h5', driver='core', backing_store=False)
group = hdf.create_group('some_group')

dtype = np.dtype([
    ('name', 'a20'),
    ('x', 'f8'),
    ('y', 'f8'),
    ('count', 'u8'),
])
ds = np.array([('something', 123.4, 567.8, 20)], dtype=dtype)

long_name = 'abcdefghijklmnopqrstuvwxyz'*50

t = time.time()
size = 1000*25
for i in range(1, size + 1):
    group.create_dataset(
        long_name+str(i),
        (len(ds),),
        maxshape=(None,),
        chunks=True,
        compression='gzip',
        compression_opts=9,
        shuffle=True,
        fletcher32=True,
        dtype=dtype,
        data=ds
    )
    if i % 1000 == 0:
        dt = time.time() - t
        t = time.time()
        print('{0} / {1} -  Rate: {2:.1f} inserts per second'.format(i, size, 1000/dt))

hdf.close()

输出:

1000 / 25000 -  Rate: 1590.9 inserts per second
2000 / 25000 -  Rate: 1770.0 inserts per second
...
17000 / 25000 -  Rate: 1724.7 inserts per second
18000 / 25000 -  Rate: 106.3 inserts per second
19000 / 25000 -  Rate: 66.9 inserts per second
20000 / 25000 -  Rate: 66.9 inserts per second
21000 / 25000 -  Rate: 67.5 inserts per second
22000 / 25000 -  Rate: 68.4 inserts per second
23000 / 25000 -  Rate: 47.7 inserts per second
24000 / 25000 -  Rate: 42.0 inserts per second
25000 / 25000 -  Rate: 39.8 inserts per second

再次,我夸大了名称的长度,只是为了快速重现此问题. 在我的问题中,名称的长度大约为25个字符,并且减速点出现在大约700k个数据集中之后. 在拥有约140万个数据集之后,速度甚至会变慢.

Again, I exaggerated the length of the name just to reproduce the issue quickly. In my problem the length of the name is about 25 characters and the slowdown point occurs after ~700k datasets are in a group. After ~1.4M datasets it gets even slower.

为什么会这样?

有解决方案/补救措施吗?

Is there a solution/remedy?

推荐答案

打开文件时尝试使用libver ='latest'.该库的最新版本极大地提高了将项目添加到组中的速度,但是出于兼容性原因,仅使用上述选项才启用此功能.

Try using libver='latest' when you open the file. Recent versions of the library vastly improved the speed for adding items to a group, but for compatibility reasons this is only enabled with the above option.

这篇关于插入许多HDF5数据集非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆