HDF5添加numpy数组缓慢 [英] HDF5 adding numpy arrays slow

查看:114
本文介绍了HDF5添加numpy数组缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

第一次使用hdf5,所以您能帮我弄清楚什么是错的,为什么添加3d numpy数组比较慢. 预处理需要3s,加上3d numpy数组(100x512x512)30s并随每个样本增加

First time using hdf5 so could you help me figure out what is wrong, why adding 3d numpy arrays is slow. Preprocessing takes 3s, adding 3d numpy array (100x512x512) 30s and rising with each sample

首先,我使用以下命令创建hdf:

First I create hdf with:

def create_h5(fname_):
  """
  Run only once
  to create h5 file for dicom images
  """
  f = h5py.File(fname_, 'w', libver='latest') 

  dtype_ = h5py.special_dtype(vlen=bytes)


  num_samples_train = 1397
  num_samples_test = 1595 - 1397
  num_slices = 100

  f.create_dataset('X_train', (num_samples_train, num_slices, 512, 512), 
    dtype=np.int16, maxshape=(None, None, 512, 512), 
    chunks=True, compression="gzip", compression_opts=4)
  f.create_dataset('y_train', (num_samples_train,), dtype=np.int16, 
    maxshape=(None, ), chunks=True, compression="gzip", compression_opts=4)
  f.create_dataset('i_train', (num_samples_train,), dtype=dtype_, 
    maxshape=(None, ), chunks=True, compression="gzip", compression_opts=4)          
  f.create_dataset('X_test', (num_samples_test, num_slices, 512, 512), 
    dtype=np.int16, maxshape=(None, None, 512, 512), chunks=True, 
    compression="gzip", compression_opts=4)
  f.create_dataset('y_test', (num_samples_test,), dtype=np.int16, maxshape=(None, ), chunks=True, 
    compression="gzip", compression_opts=4)
  f.create_dataset('i_test', (num_samples_test,), dtype=dtype_, 
    maxshape=(None, ), 
    chunks=True, compression="gzip", compression_opts=4)

  f.flush()
  f.close()
  print('HDF5 file created')

然后我运行代码更新hdf文件:

Then I run code updating hdf file:

num_samples_train = 1397
num_samples_test = 1595 - 1397

lbl = pd.read_csv(lbl_fldr + 'stage1_labels.csv')

patients = os.listdir(dicom_fldr)
patients.sort()

f = h5py.File(h5_fname, 'a') #r+ tried

train_counter = -1
test_counter = -1

for sample in range(0, len(patients)):    

    sw_start = time.time()

    pat_id = patients[sample]
    print('id: %s sample: %d \t train_counter: %d test_counter: %d' %(pat_id, sample, train_counter+1, test_counter+1), flush=True)

    sw_1 = time.time()
    patient = load_scan(dicom_fldr + patients[sample])        
    patient_pixels = get_pixels_hu(patient)       
    patient_pixels = select_slices(patient_pixels)

    if patient_pixels.shape[0] != 100:
        raise ValueError('Slices != 100: ', patient_pixels.shape[0])



    row = lbl.loc[lbl['id'] == pat_id]

    if row.shape[0] > 1:
        raise ValueError('Found duplicate ids: ', row.shape[0])

    print('Time preprocessing: %0.2f' %(time.time() - sw_1), flush=True)



    sw_2 = time.time()
    #found test patient
    if row.shape[0] == 0:
        test_counter += 1

        f['X_test'][test_counter] = patient_pixels
        f['i_test'][test_counter] = pat_id
        f['y_test'][test_counter] = -1


    #found train
    else: 
        train_counter += 1

        f['X_train'][train_counter] = patient_pixels
        f['i_train'][train_counter] = pat_id
        f['y_train'][train_counter] = row.cancer

    print('Time saving: %0.2f' %(time.time() - sw_2), flush=True)

    sw_el = time.time() - sw_start
    sw_rem = sw_el* (len(patients) - sample)
    print('Elapsed: %0.2fs \t rem: %0.2fm %0.2fh ' %(sw_el, sw_rem/60, sw_rem/3600), flush=True)


f.flush()
f.close()

推荐答案

速度几乎可以肯定是由于压缩和分块所致.很难做到这一点.在过去的项目中,我通常不得不关闭压缩,因为它太慢了,尽管我总体上并没有放弃HDF5中的压缩想法.

The slowness is almost certainly due to the compression and chunking. It's hard to get this right. In my past projects I often had to turn off compression because it was too slow, although I have not given up on the idea of compression in HDF5 in general.

首先,您应尝试确认压缩和分块是造成性能问题的原因.关闭分块和压缩(即省略chunks=True, compression="gzip", compression_opts=4参数),然后重试.我怀疑它会快很多.

First you should try to confirm that compression and chunking are the cause of the performance issues. Turn off chunking and compression (i.e. leave out the chunks=True, compression="gzip", compression_opts=4 parameters) and try again. I suspect it will be a lot faster.

如果要使用压缩,则必须了解分块的工作原理,因为HDF逐块压缩数据. Google,但至少要阅读h5py中关于分块的部分文档.以下引用至关重要:

If you want to use compression you must understand how chunking works, because HDF compresses the data chunk-by-chunk. Google it, but at least read the section on chunking from the h5py docs. The following quote is crucial:

成块对性能有影响.建议将块的总大小保持在10 KiB和1 MiB之间,对于较大的数据集,则应将其更大. 还请记住,访问块中的任何元素时,将从磁盘读取整个块.

通过设置chunks=True,您可以让h5py自动确定块大小(打印数据集的chunks属性以查看它们的大小).假设第一个维度(您的sample维度)中的块大小为5.这意味着添加一个样本时,底层的HDF库将从磁盘读取包含该样本的所有块(因此,总共它将完全读取5个样本).对于每个块,HDF都会读取,解压缩,添加新数据,对其进行压缩,然后将其写回到磁盘.不用说,这很慢. HDF具有块高速缓存,因此未压缩的块可以驻留在内存中,这一事实得到了缓解.但是,块缓存似乎很小(请参见此处 ),因此我认为在for循环的每次迭代中,所有块都在缓存中换入和换出.我找不到h5py中的任何设置来更改块缓存大小.

By setting chunks=True you let h5py determine the chunk sizes for you automatically (print the chunks property of the dataset to see what they are). Let's say the chunk size in the first dimension (your sample dimension) is 5 . This would mean that when you add one sample, the underlying HDF library will read all the chunks that contain that sample from disk (so in total it will read the 5 samples completely). For every chunk HDF will read it, uncompress it, add the new data, compress it, and write it back to disk. Needless to say, this is slow. This is mitigated by the fact that HDF has a chunk cache, so that uncompressed chunks can reside in memory. However the chunk cache seems to be rather small (see here), so I think all the chunks are swapped in and out of the cache in every iteration of your for-loop. I couldn't find any setting in h5py to alter the chunk cache size.

您可以通过为chunks关键字参数分配一个元组来显式设置块大小.考虑到所有这些,您可以尝试使用不同的块大小.我的第一个实验是将第一(样本)维中的块大小设置为1,以便可以访问单个样本,而无需将其他样本读入缓存.让我知道是否有帮助,我很想知道.

You can explicitly set the chunk size by assigning a tuple to the chunks keyword parameter. With all this in mind you can experiment with different chunk sizes. My first experiment would be to set the chunk size in the first (sample) dimension to 1, so that individual samples can be accessed without reading other samples into the cache. Let me know if this helped, I'm curious to know.

即使您发现适合写入数据的块大小,读取时它仍然可能很慢,具体取决于读取的切片.选择块大小时,请记住您的应用程序通常如何读取数据.您可能必须使文件创建例程适应这些块大小(例如,逐块填充数据集).或者,您可以决定不值得这样做,并创建未压缩的HDF5文件.

Even if you find a chunk size that works well for writing the data, it may still be slow when reading, depending on which slices you read. When choosing the chunk size, keep in mind on how your application typically reads the data. You may have to adapt your file-creation routines to these chunk sizes (e.g. fill your data sets chunk by chunk). Or you can decide that it's simply not worth the effort and create uncompressed HDF5 files.

最后,我将在create_dataset调用中设置shuffle=True.这样可以为您提供更好的压缩率.但是,它不应该影响性能.

Finally, I would set shuffle=True in the create_dataset calls. This may get you a better compression ratio. It shouldn't influence the performance however.

这篇关于HDF5添加numpy数组缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆