保存到hdf5非常慢(Python冻结) [英] Saving to hdf5 is very slow (Python freezing)

查看:82
本文介绍了保存到hdf5非常慢(Python冻结)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将瓶颈值保存到新创建的hdf5文件中. 瓶颈值以形状(120,10,10, 2048)的形式批量出现. 单独保存一个批处理将占用超过16个演出,而python似乎在冻结该批处理.根据最近的发现(请参阅更新,看来hdf5占用大内存是可以的,但是冻结的部分似乎是一个小故障.

I'm trying to save bottleneck values to a newly created hdf5 file. The bottleneck values come in batches of shape (120,10,10, 2048). Saving one alone batch is taking up more than 16 gigs and python seems to be freezing at that one batch. Based on recent findings (see update, it seems hdf5 taking up large memory is okay, but the freezing part seems to be a glitch.

我只想保存前两个 2 批次用于测试,而仅保存 训练数据集(再次,这是一次测试运行),但我什至无法超越第一批.它只会在第一批中停顿,并且不会循环到下一个迭代.如果我尝试检查hdf5,资源管理器将变慢,Python将冻结.如果我尝试杀死Python(即使不检查hdf5文件),Python也无法正确关闭,并会强制重启.

I'm only trying to save the first 2 batches for test purposes and only the training data set (once again,this is a test run), but I can't even get past the first batch. It just stalls at the first batch and doesn't loop to the next iteration. If I try to check the hdf5, explorer will get sluggish, and Python will freeze. If I try to kill Python (even with out checking hdf5 file), Python doesn't close properly and it forces a restart.

以下是相关的代码和数据:

Here is the relevant code and data:

总数据点约为90,000 ish,以120个批次发布.

Total data points are about 90,000 ish, released in batches of 120.

Bottleneck shape is (120,10,10,2048)

所以我要保存的第一批是(120,10,10,2048)

So the first batch I'm trying to save is (120,10,10,2048)

这是我尝试保存数据集的方式:

Here is how I tried to save the dataset:

with h5py.File(hdf5_path, mode='w') as hdf5:
                hdf5.create_dataset("train_bottle", train_shape, np.float32)
                hdf5.create_dataset("train_labels", (len(train.filenames), params['bottle_labels']),np.uint8)
                hdf5.create_dataset("validation_bottle", validation_shape, np.float32)
                hdf5.create_dataset("validation_labels",
                                              (len(valid.filenames),params['bottle_labels']),np.uint8)



 #this first part above works fine

                current_iteration = 0
                print('created_datasets')
                for x, y in train:

                    number_of_examples = len(train.filenames) # number of images
                    prediction = model.predict(x)
                    labels = y
                    print(prediction.shape) # (120,10,10,2048)
                    print(y.shape) # (120, 12)
                    print('start',current_iteration*params['batch_size']) # 0
                    print('end',(current_iteration+1) * params['batch_size']) # 120

                    hdf5["train_bottle"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = prediction
                    hdf5["train_labels"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = labels
                    current_iteration += 1
                    print(current_iteration)
                    if current_iteration == 3:
                       break

这是打印语句的输出:

(90827, 10, 10, 2048) # print(train_shape)

(6831, 10, 10, 2048)  # print(validation_shape)
created_datasets
(120, 10, 10, 2048)  # print(prediction.shape)
(120, 12)           #label.shape
start 0             #start of batch
end 120             #end of batch

# Just stalls here instead of printing `print(current_iteration)`

它在这里停顿了一段时间(超过20分钟),并且hdf5文件的大小缓慢增长(在我强制杀死之前,现在大约有20个演出).实际上,我什至不能用任务管理器强制终止,在这种情况下,我必须重新启动操作系统才能终止Python.

It just stalls here for while (20 mins +), and the hdf5 file slowly grows in size (around 20 gigs now, before I force kill). Actually I can't even force kill with task manager, I have to restart the OS, to actually kill Python in this case.

在玩了一段时间我的代码之后,似乎有一个奇怪的错误/行为.

After playing around with my code for a bit, there seems to be a strange bug/behavior.

相关部分在这里:

          hdf5["train_bottle"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = prediction
                hdf5["train_labels"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = labels

如果我运行这些行之一,我的脚本将经历迭代,并按预期自动中断.因此,如果我执行或"操作,则不会冻结.此过程也相当快-不到一分钟.

If I run either of these lines, my script will go through the iterations, and automatically break as expected. So there is no freeze if I run either-or. It happens fairly quickly as well -- less than one min.

如果我运行第一行('train_bottle'),即使只有几批,我的内存也要占用大约69-72个演出.如果我尝试更多批次,则内存是相同的.因此,我假设train_bottle根据要分配的数据集的大小参数(而不是实际填充时间)决定存储的大小. 因此,尽管有72场演出,但运行还是相当快(一分钟).

If I run the first line ('train_bottle'), my memory is taking up about 69-72 gigs, even if it's only a couple of batches. If I try more batches, the memory is the same. So I'm assuming the train_bottle decided storage based on the size parameters I'm assigning the dataset, and not actually when it gets filled. So despite the 72 gigs, it's running fairly quickly (one min).

如果我运行第二行train_labels,我的内存将占用几兆字节. 迭代没有问题,并且执行了break语句.

If I run the second line, train_labels , my memory takes up a few megabytes. There is no problem with the iterations, and break statement is executed.

但是,现在这是问题所在,如果我尝试同时运行这两行(在我的情况下这是必须的,因为我需要同时保存"train_bottle"和"train_labels"),那么我在第一次迭代时就遇到了冻结,即使在20分钟之后,它也不会继续进行第二次迭代. Hdf5文件正在缓慢增长,但是如果我尝试访问它,则Windows资源管理器的速度将降低为蜗牛,并且我无法关闭Python -我必须重新启动操作系统.

However, now here is the problem, If I try to run both lines (which in my case is necessary as I need to save both 'train_bottle' and 'train_labels'), I'm experiencing a freeze on the first iteration, and it doesn't continue to the second iteration, even after 20 mins. The Hdf5 file is slowly growing, but if I try to access it, Windows Explorer slows down to a snail and I can't close Python -- I have to restart the OS.

因此,我不确定在尝试同时运行这两行时是什么问题-好像我在运行内存不足的train_data行一样,如果运行正常并在一分钟内结束.

So I'm not sure what the problem is when trying to running both lines -- as if I run the memory hungry train_data line, if works perfectly and ends within a min.

推荐答案

将数据写入HDF5

如果在未指定块形状的情况下写入块数据集,则h5py将自动为您执行此操作.由于h5py无法知道您将如何从数据集中写入或读取数据,因此这通常会导致性能下降.

If you write to a chunked dataset without specifying a chunkshape, h5py will do that automaticly for you. Since h5py can't know how do you wan't to write or read the data from the dataset, this will often end up in a bad performance.

您还使用默认的1 MB块缓存大小.如果您仅写入块的一部分,而该块不适合缓存(这很可能是1MP块高速缓存大小),则整个块将在内存中读取,修改并写回到磁盘.如果多次发生这种情况,您将看到性能远远超过HDD/SSD的顺序IO速度.

You also use the default chunk-cache-size of 1 MB. If you only write to a part of a chunk and the chunk doesn't fit in the cache (which is very likely with 1MP chunk-cache-size), the whole chunk will be read in memory, modified and written back to disk. If that happens multiple times you will see a performance which is far beyond the sequential IO-speed of your HDD/SSD.

在下面的示例中,我假设您仅沿第一个维度进行读取或写入.如果不是这样,则必须根据您的需要进行修改.

In the following example I assume that you only read or write along your first dimension. If not this has to be modified to your needs.

import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time

batch_size=120
train_shape=(90827, 10, 10, 2048)
hdf5_path='Test.h5'
# As we are writing whole chunks here this isn't realy needed,
# if you forget to set a large enough chunk-cache-size when not writing or reading 
# whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)
f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*200) #200 MB cache size
dset_train_bottle = f.create_dataset("train_bottle", shape=train_shape,dtype=np.float32,chunks=(10, 10, 10, 2048),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)
t1=time.time()
#Testing with 2GB of data
for i in range(20):
    #prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)
    dset_train_bottle[i*batch_size:(i+1)*batch_size,:,:,:]=prediction

f.close()
print(time.time()-t1)
print("MB/s: " + str(2000/(time.time()-t1)))

修改 循环中的数据创建花费了很多时间,因此我在时间测量之前创建了数据.

Edit The data creation in the loop took quite a lot of time, so I create the data before the time measurement.

这至少应提供900 MB/s的吞吐量(CPU限制).有了真实的数据和较低的压缩率,您应该可以轻松达到硬盘的顺序IO速度.

This should give at least 900 MB/s throuput (CPU limited). With real data and lower compression ratios, you should easily reach the sequential IO-speed of your harddisk.

如果多次错误地调用此块,则使用with语句打开HDF5-File也会导致性能下降.这将关闭并重新打开文件,删除块缓存.

Open a HDF5-File with the with statement can also lead to bad performance if you make the mistake to call this block multiple times. This would close and reopen the file, deleting the chunk-cache.

为了确定正确的块大小,我还建议: https://stackoverflow.com/a/48405220/4045774 https://stackoverflow.com/a/44961222/4045774

For determination of the right chunk-size I would also recommend: https://stackoverflow.com/a/48405220/4045774 https://stackoverflow.com/a/44961222/4045774

这篇关于保存到hdf5非常慢(Python冻结)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆