将数据写入SSD磁盘上的h5py似乎很慢:我该怎么办才能加快速度 [英] Writing Data to h5py on SSD disk appears slow: What can I do to speed it up

查看:350
本文介绍了将数据写入SSD磁盘上的h5py似乎很慢:我该怎么办才能加快速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将数据写入h5py数据集,但使用高内存12核GCE实例写入SSD磁盘,但是它运行了13个小时,而且看不到尽头. 我正在GCE实例上运行Jupyter Notebook,以释放大量小文件(存储在第二个非SSD磁盘上),然后再将它们添加到SSD磁盘上存储的文件中的h5py数据集中.

I'm trying to write data to a h5py dataset but using a high memory 12 core GCE instance to write to an SSD disk but it runs for 13 hours with no end in sight. I'm running Jupyter Notebook on the GCE instance to unpickle a large number of small files (stored on a 2nd non-ssd disk) before adding them to a h5py dataset in a file stored on the ssd disk

  • 最大形状= (29914, 251328)
  • 大块= (59, 982)
  • 压缩= gzip
  • dtype = float64
  • Max shape=(29914, 251328)
  • Chunks = (59, 982)
  • compression = gzip
  • dtype = float64

我的代码在下面列出

#Get a sample
minsample = 13300
sampleWithOutReplacement = random.sample(ListOfPickles,minsample)

print(h5pyfile)
with h5py.File(h5pyfile, 'r+') as hf:
    GroupToStore = hf.get('group')
    DatasetToStore = GroupToStore.get('ds1')
    #Unpickle the contents and add in th h5py file                
    for idx,files in enumerate(sampleWithOutReplacement):
        #Sample the minimum number of examples
        time FilePath = os.path.join(SourceOfPickles,files)
        #Use this method to auto close the file
        with open(FilePath,"rb") as f:
            %time DatasetToStore[idx:] = pickle.load(f)
            #print("Processing file ",idx)

print("File Closed")

使用上面的代码填充的每个数据集,磁盘上的h5py文件似乎增加了1.4GB,下面的代码是我在h5py文件中创建数据集的代码

The h5py File on disk seems to increase 1.4GB each dataset I populate using the code above and below is my code to create the dataset in the h5py file

group.create_dataset(labels, dtype='float64',shape= (maxSize, 251328),maxshape=(maxSize,251328),compression="gzip")

我可以对我的配置或代码或两者进行哪些改进,以减少填充h5py文件所需的时间?

What improvements can I make to either my configuration or my code or both to reduce the time needed to populate the h5py file?

更新1 我在jupyter笔记本中添加了一些魔术,以安排时间,我欢迎任何建议加快加载到数据存储中的建议,据报道,这需要花费 8小时

Update 1 I added some magic to the jupyter notebook to time the process, I'd welcome any advice on speeding up the loading into the datastore which was reported as taking 8hrs

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 14.1 µs
CPU times: user 8h 4min 11s, sys: 1min 18s, total: 8h 5min 30s
Wall time: 8h 5min 29s      

推荐答案

  1. JRoose是正确的,代码中的某些内容似乎是错误的.

  1. JRoose is right, something with the code seems to be wrong.

默认情况下,h5py仅使用1MB的块缓存,这不足以解决您的问题. 您可以在低级API"中更改缓存设置,或改用h5py_cache. https://pypi.python.org/pypi/h5py-cache/1.0

By default h5py uses only a chunk-cache of 1MB, which isn't enough for your problem. You could change your cache setting in the Low Level API or use h5py_cache instead. https://pypi.python.org/pypi/h5py-cache/1.0

更改行

with h5py.File(h5pyfile, 'r+') as hf

with  h5py_cache.File(h5pyfile, 'r+',chunk_cache_mem_size=500*1024**2) as hf

例如将块缓存增加到500MB.

to increase the chunk cache for example to 500MB.

我假设pickle.load(f)产生一维数组;您的数据集是2D. 在这种情况下,编写时没有错

I assume pickle.load(f) results in a 1D Array; your Dataset is 2D. In this case there is nothing wrong when you write

%time DatasetToStore[idx,:] = pickle.load(f)

但根据我的发现,它会相当慢.为了提高速度,请在将数据传递到数据集之前制作一个2D数组.

but to my findings it would be rather slow. To increase the speed make a 2D array before passing the data to the Dataset.

%time DatasetToStore[idx:idx+1,:] = np.expand_dims(pickle.load(f), axis=0)

我真的不知道为什么这样做会更快,但是在我的脚本中,这个版本比上面的版本快20倍. 从HDF5文件读取内容也是如此.

I don't really know why this is faster, but in my scripts this version is about 20 times faster than the version above. The same goes for reading from a HDF5 File.

这篇关于将数据写入SSD磁盘上的h5py似乎很慢:我该怎么办才能加快速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆