保存到 hdf5 非常慢(Python 冻结) [英] Saving to hdf5 is very slow (Python freezing)

查看:40
本文介绍了保存到 hdf5 非常慢(Python 冻结)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将瓶颈值保存到新创建的 hdf5 文件中.瓶颈值以 (120,10,10, 2048) 的形式批量出现.单独保存一批会占用超过 16 场演出,而 Python 似乎在这一批上冻结了.根据最近的发现(见更新,似乎 hdf5 占用大内存是可以的,但冻结部分似乎是一个小故障.

I'm trying to save bottleneck values to a newly created hdf5 file. The bottleneck values come in batches of shape (120,10,10, 2048). Saving one alone batch is taking up more than 16 gigs and python seems to be freezing at that one batch. Based on recent findings (see update, it seems hdf5 taking up large memory is okay, but the freezing part seems to be a glitch.

我只是想保存前 2 个批次用于测试目的,并且只保存训练数据集(再次,这是一个测试运行),但我什至无法通过第一批.它只是在第一批停止,不会循环到下一次迭代.如果我尝试检查 hdf5,资源管理器将变得缓慢,Python 将冻结.如果我尝试杀死 Python(即使不检查 hdf5 文件),Python 也不会正确关闭并强制重新启动.

I'm only trying to save the first 2 batches for test purposes and only the training data set (once again,this is a test run), but I can't even get past the first batch. It just stalls at the first batch and doesn't loop to the next iteration. If I try to check the hdf5, explorer will get sluggish, and Python will freeze. If I try to kill Python (even with out checking hdf5 file), Python doesn't close properly and it forces a restart.

这是相关的代码和数据:

Here is the relevant code and data:

总数据点约为 90,000 ish,分 120 个批次发布.

Total data points are about 90,000 ish, released in batches of 120.

Bottleneck shape is (120,10,10,2048)

所以我要保存的第一批是 (120,10,10,2048)

So the first batch I'm trying to save is (120,10,10,2048)

这是我尝试保存数据集的方式:

Here is how I tried to save the dataset:

with h5py.File(hdf5_path, mode='w') as hdf5:
                hdf5.create_dataset("train_bottle", train_shape, np.float32)
                hdf5.create_dataset("train_labels", (len(train.filenames), params['bottle_labels']),np.uint8)
                hdf5.create_dataset("validation_bottle", validation_shape, np.float32)
                hdf5.create_dataset("validation_labels",
                                              (len(valid.filenames),params['bottle_labels']),np.uint8)



 #this first part above works fine

                current_iteration = 0
                print('created_datasets')
                for x, y in train:

                    number_of_examples = len(train.filenames) # number of images
                    prediction = model.predict(x)
                    labels = y
                    print(prediction.shape) # (120,10,10,2048)
                    print(y.shape) # (120, 12)
                    print('start',current_iteration*params['batch_size']) # 0
                    print('end',(current_iteration+1) * params['batch_size']) # 120

                    hdf5["train_bottle"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = prediction
                    hdf5["train_labels"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = labels
                    current_iteration += 1
                    print(current_iteration)
                    if current_iteration == 3:
                       break

这是打印语句的输出:

(90827, 10, 10, 2048) # print(train_shape)

(6831, 10, 10, 2048)  # print(validation_shape)
created_datasets
(120, 10, 10, 2048)  # print(prediction.shape)
(120, 12)           #label.shape
start 0             #start of batch
end 120             #end of batch

# Just stalls here instead of printing `print(current_iteration)`

它只是在这里停滞了一段时间(20 分钟以上),并且 hdf5 文件的大小慢慢增长(现在大约 20 个演出,在我强行杀死之前).实际上我什至不能用任务管理器强制杀死,我必须重新启动操作系统,在这种情况下实际上杀死 Python.

It just stalls here for while (20 mins +), and the hdf5 file slowly grows in size (around 20 gigs now, before I force kill). Actually I can't even force kill with task manager, I have to restart the OS, to actually kill Python in this case.

在我的代码上玩了一会儿后,似乎有一个奇怪的错误/行为.

After playing around with my code for a bit, there seems to be a strange bug/behavior.

相关部分在这里:

          hdf5["train_bottle"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = prediction
                hdf5["train_labels"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = labels

如果我运行这些行中的任何一行,我的脚本将进行迭代,并按预期自动中断.因此,如果我运行非此即彼",则不会冻结.它发生得也很快——不到一分钟.

If I run either of these lines, my script will go through the iterations, and automatically break as expected. So there is no freeze if I run either-or. It happens fairly quickly as well -- less than one min.

如果我运行第一行 ('train_bottle'),我的内存占用了大约 69-72 个演出,即使它只是几个批次.如果我尝试更多批次,内存是相同的.所以我假设 train_bottle 根据我分配数据集的大小参数决定存储,而不是实际填充时.因此,尽管有 72 场演出,但运行速度相当快(一分钟).

If I run the first line ('train_bottle'), my memory is taking up about 69-72 gigs, even if it's only a couple of batches. If I try more batches, the memory is the same. So I'm assuming the train_bottle decided storage based on the size parameters I'm assigning the dataset, and not actually when it gets filled. So despite the 72 gigs, it's running fairly quickly (one min).

如果我运行第二行 train_labels ,我的内存会占用几兆字节.迭代没有问题,执行break语句.

If I run the second line, train_labels , my memory takes up a few megabytes. There is no problem with the iterations, and break statement is executed.

但是,现在问题来了,如果我尝试运行两行(在我的情况下这是必要的,因为我需要同时保存train_bottle"和train_labels"),我在第一次迭代时遇到了冻结,即使在 20 分钟后,它也不会继续进行第二次迭代.Hdf5 文件正在缓慢增长,但如果我尝试访问它,Windows 资源管理器会减慢速度,我无法关闭 Python -- 我必须重新启动操作系统.

However, now here is the problem, If I try to run both lines (which in my case is necessary as I need to save both 'train_bottle' and 'train_labels'), I'm experiencing a freeze on the first iteration, and it doesn't continue to the second iteration, even after 20 mins. The Hdf5 file is slowly growing, but if I try to access it, Windows Explorer slows down to a snail and I can't close Python -- I have to restart the OS.

所以我不确定尝试运行这两行时的问题是什么——就好像我运行内存饥渴的 train_data 行一样,如果工作完美并在一分钟内结束.

So I'm not sure what the problem is when trying to running both lines -- as if I run the memory hungry train_data line, if works perfectly and ends within a min.

推荐答案

将数据写入 HDF5

如果您在未指定块形状的情况下写入分块数据集,h5py 会自动为您完成.由于 h5py 无法知道您不想从数据集中写入或读取数据,因此这通常会导致性能不佳.

If you write to a chunked dataset without specifying a chunkshape, h5py will do that automaticly for you. Since h5py can't know how do you wan't to write or read the data from the dataset, this will often end up in a bad performance.

您还使用默认的块缓存大小 1 MB.如果您只写入块的一部分并且该块不适合缓存(这很可能使用 1MP 块缓存大小),则整个块将在内存中读取,修改并写回磁盘.如果这种情况多次发生,您将看到远远超出 HDD/SSD 顺序 IO 速度的性能.

You also use the default chunk-cache-size of 1 MB. If you only write to a part of a chunk and the chunk doesn't fit in the cache (which is very likely with 1MP chunk-cache-size), the whole chunk will be read in memory, modified and written back to disk. If that happens multiple times you will see a performance which is far beyond the sequential IO-speed of your HDD/SSD.

在下面的示例中,我假设您仅沿第一个维度读取或写入.如果不是,则必须根据您的需要进行修改.

In the following example I assume that you only read or write along your first dimension. If not this has to be modified to your needs.

import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time

batch_size=120
train_shape=(90827, 10, 10, 2048)
hdf5_path='Test.h5'
# As we are writing whole chunks here this isn't realy needed,
# if you forget to set a large enough chunk-cache-size when not writing or reading 
# whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)
f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*200) #200 MB cache size
dset_train_bottle = f.create_dataset("train_bottle", shape=train_shape,dtype=np.float32,chunks=(10, 10, 10, 2048),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)
t1=time.time()
#Testing with 2GB of data
for i in range(20):
    #prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)
    dset_train_bottle[i*batch_size:(i+1)*batch_size,:,:,:]=prediction

f.close()
print(time.time()-t1)
print("MB/s: " + str(2000/(time.time()-t1)))

编辑循环中的数据创建花费了相当多的时间,所以我在时间测量之前创建了数据.

Edit The data creation in the loop took quite a lot of time, so I create the data before the time measurement.

这应该提供至少 900 MB/s 的吞吐量(CPU 受限).使用真实数据和较低的压缩比,您应该可以轻松达到硬盘的顺序 IO 速度.

This should give at least 900 MB/s throuput (CPU limited). With real data and lower compression ratios, you should easily reach the sequential IO-speed of your harddisk.

如果您错误地多次调用此块,使用 with 语句打开 HDF5 文件也会导致性能下降.这将关闭并重新打开文件,删除块缓存.

Open a HDF5-File with the with statement can also lead to bad performance if you make the mistake to call this block multiple times. This would close and reopen the file, deleting the chunk-cache.

为了确定正确的块大小,我还建议:https://stackoverflow.com/a/48405220/4045774https://stackoverflow.com/a/44961222/4045774

For determination of the right chunk-size I would also recommend: https://stackoverflow.com/a/48405220/4045774 https://stackoverflow.com/a/44961222/4045774

这篇关于保存到 hdf5 非常慢(Python 冻结)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆