H5PY不坚持分块规范? [英] h5py not sticking to chunking specification?

查看:73
本文介绍了H5PY不坚持分块规范?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题: 我有现有的netCDF4文件(其中约有5000个)(通常形状为96x3712x3712)数据点(float32).这些文件的第一个维度是时间(每天1个文件),第二个和第三个空间维度是. 目前,由于以下原因,在第一维上切片(甚至是局部切片)将花费大量时间:

Problem: I have existing netCDF4 files (about 5000 of them), (typically in shape 96x3712x3712) datapoints (float32). These are files with the first dimension being time (1 file per day), the second and third spatial dimensions. Currently, making a slice over the first dimension (even a partial slice), would take a lot of time because of the following reasons:

  • netCDF文件以1x3712x3712的块大小进行了分块.在时间维度上进行切片基本上会读取整个文件.
  • 在所有较小的文件中循环(甚至在多个进程中)也将花费大量时间.

我的目标:

  • 创建每月文件(约2900x3712x3712)数据点
  • 优化它们以进行时间维度切片(块大小为2900x1x1或在空间维度上稍大)

其他要求:

  • 文件应附加单个时间戳(1x3712x3712),此更新过程应少于15分钟
  • 查询应该足够快:在不到一秒钟的时间内完成整个切片(即2900x1x1)==>实际上没有太多数据...
  • 最好是文件在更新时应该可供多个进程读取
  • 处理历史数据(其他每日5000个文件)的时间最好少于两周.

我已经尝试了多种方法:

I tried already multiple approaches:

  • 串联netcdf文件并重新打包它们==>占用太多内存和太多时间...
  • 将它们从熊猫写到hdf文件中(使用pytables)==>创建具有巨大索引的宽表.由于元数据的限制,最终将花费大量的时间来读取数据,并且需要在空间维度上平铺数据集.
  • 我的最后一种方法是使用h5py将它们写入hdf5文件:

以下是创建单个月度文件的代码:

Here's the code to create a single monthly file:

import h5py
import pandas as pd
import numpy as np

def create_h5(fps):
    timestamps=pd.date_range("20050101",periods=31*96,freq='15T') #Reference time period
    output_fp = r'/data/test.h5'
    try:
        f = h5py.File(output_fp, 'a',libver='latest')
        shape = 96*nodays, 3712, 3712
        d = f.create_dataset('variable', shape=(1,3712,3712), maxshape=(None,3712,3712),dtype='f', compression='gzip', compression_opts=9,chunks=(1,29,29))
        f.swmr_mode = True
        for fp in fps:
            try:
                nc=Dataset(fp)
                times = num2date(nc.variables['time'][:], nc.variables['time'].units)
                indices=np.searchsorted(timestamps, times)
                for j,time in enumerate(times):
                    logger.debug("File: {}, timestamp: {:%Y%m%d %H:%M}, pos: {}, new_pos: {}".format(os.path.basename(fp),time,j,indices[j]))
                    d.resize((indices[j]+1,shape[1],shape[2]))
                    d[indices[j]]=nc.variables['variable'][j:j+1]
                    f.flush()
            finally:
                nc.close()
    finally:
        f.close()
    return output_fp

我正在使用HDF5的最新版本来具有SWMR选项. fps参数是每日netCDF4文件的文件路径的列表.它会在大约2个小时内创建文件(在ssd上,但是我发现创建文件主要受CPU限制).

I'm using the last version of HDF5 to have the SWMR option. The fps argument is a list of file paths of the daily netCDF4 files. It creates the file (on an ssd, but I see that creating the file is mainly CPU bound) in about 2 hours, which is acceptable.

我设置了压缩功能,以使文件大小保持在限制范围内.我进行了更早的不带测试的测试,发现不带创建的速度要快一些,但是使用压缩后切片的时间不会更长. H5py自动将数据集分成1x116x116个块.

I have compression set up to keep the file size within limits. I did earlier tests without, and saw that the creation without is a bit faster but the slicing takes not so much longer with compression. H5py automatically chunks the dataset in 1x116x116 chunks.

现在的问题是:在具有RAID 6设置的NAS上进行切片,即使在单个块中,也要花费约20秒的时间来切片时间维度...

Now the problem: slicing on a NAS with RAID 6 setup, takes about 20seconds to slice the time dimension, even though it is in a single chunk...

我认为,即使它位于文件的单个块中,因为我将所有值都写在一个循环中,所以必须将其拆分成一些方式(尽管不知道此过程如何工作).这就是为什么我尝试使用HDF5的CML工具将h5repack放入一个新文件的过程,该文件具有相同的块,但希望将值重新排序,以使查询能够以更连续的顺序读取值,但没有运气.尽管此过程耗时6小时,但对查询速度没有任何帮助.

I figure, that even though it is in a single chunk in the file, because I wrote all of the values in a loop, it must be fragmented some how (don't know how this process works though). This is why I tried to do a h5repack using the CML tools of HDF5 into a new file, with the same chunks but hopefully reordering the values so that the query is able to read the values in a more sequential order, but no luck. Even though this process took 6h to run, it didn't do a thing on the query speed.

如果我正确地进行计算,则读取一个块(2976x32x32)只有几MB大小(未压缩的11MB,我认为压缩的仅1MB多一点).这怎么需要这么长时间?我究竟做错了什么?如果有人可以照亮幕后的实际情况,我将感到非常高兴.

If I do my calculations right, reading one chunk (2976x32x32) is only a few MB big (11MB uncompressed, only a bit more than 1MB compressed I think). How can this take so long? What am I doing wrong? I would be glad if someone can shine a light on what is actually going on behind the scenes...

推荐答案

块大小的影响

在最坏的情况下,读取和写入一个块可以视为随机读取/写入操作. SSD的主要优点是读取或写入小块数据的速度. HDD在此任务上要慢得多(可以观察到100倍),NAS甚至比HDD还要慢得多.

In a worst case scenario reading and writing one chunk can be considered as random read/write operation. The main advantage of a SSD is the speed of reading or writing small chunks of data. A HDD is much slower at this task (a factor 100 can be observed), a NAS can even be much slower than a HDD.

因此,解决问题的方法将是更大的块大小.我的系统(Core i5-4690)上有一些基准测试.

So the solution of the problem will be a larger chunk size. Some benchmarks on my system (Core i5-4690).

示例_1(块大小(1,29,29)= 3.4 kB):

Exampe_1 (chunk size (1,29,29)=3,4 kB):

import numpy as np
import tables #needed for blosc
import h5py as h5
import time
import h5py_cache as h5c

def original_chunk_size():
    File_Name_HDF5='some_Path'
    #Array=np.zeros((1,3712,3712),dtype=np.float32)
    Array=np.random.rand(96,3712,3712)

    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    nodays=1

    shape = 96*nodays, 3712, 3712
    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(1,29,29),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

    #Writing
    t1=time.time()
    for i in xrange(0,96*nodays):
        d[i:i+1,:,:]=Array

    f.close()
    print(time.time()-t1)

    #Reading
    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    d=f['variable']

    for i in xrange(0,3712,29):
        for j in xrange(0,3712,29):
            A=np.copy(d[:,i:i+29,j:j+29])

    print(time.time()-t1)

结果(写/读):

SSD:38s/54s

SSD: 38s/54s

HDD:40s/57s

HDD: 40s/57s

NAS:252秒/823秒

NAS: 252s/823s

在第二个示例中,我将使用h5py_chache,因为我将不维护(1,3712,3712)的块.标准的chunk-chache-size只有1 MB,因此必须进行更改,以避免对chunk进行多次读/写操作. https://pypi.python.org/pypi/h5py-cache/1.0

In the second example I will use h5py_chache because I wan't to maintain providing chunks of (1,3712,3712). The standard chunk-chache-size is only one MB so it has to be changed, to avoid multiple read/write operations on chunks. https://pypi.python.org/pypi/h5py-cache/1.0

示例_2(块大小(96,58,58)= 1,3 MB):

Exampe_2 (chunk size (96,58,58)=1,3 MB):

import numpy as np
import tables #needed for blosc
import h5py as h5
import time
import h5py_cache as h5c

def modified_chunk_size():
    File_Name_HDF5='some_Path'
    Array=np.random.rand(1,3712,3712)

    f = h5c.File(File_Name_HDF5, 'a',libver='latest', 
    chunk_cache_mem_size=6*1024**3)
    f.swmr_mode = True
    nodays=1

    shape = 96*nodays, 3712, 3712
    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

    #Writing
    t1=time.time()
    for i in xrange(0,96*nodays):
        d[i:i+1,:,:]=Array

    f.close()
    print(time.time()-t1)

    #Reading
    f = h5c.File(File_Name_HDF5, 'a',libver='latest', chunk_cache_mem_size=6*1024**3) #6 GB chunk chache
    f.swmr_mode = True
    d=f['variable']

    for i in xrange(0,3712,58):
        for j in xrange(0,3712,58):
            A=np.copy(d[:,i:i+58,j:j+58])

    print(time.time()-t1)

结果(写/读):

SSD:10s/16s

SSD: 10s/16s

硬盘:10s/16s

HDD: 10s/16s

NAS:13秒/20秒

NAS: 13s/20s

通过最小化api调用(读取和写入较大的块块),可以进一步提高读取/写入速度.

The read/write speed can further be improved by mininimizing the api calls (reading and writing larger chunk blocks).

我也不想提及她的压缩方法. Blosc可以实现高达1GB/s的吞吐量(CPU瓶颈),gzip速度较慢,但​​压缩率更高.

I also want't to mention her the compression method. Blosc can achieve up to 1GB/s throughput (CPU bottlenecking) gzip is slower, but provides better compression ratios.

d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=3)

20s/30s文件大小:101 MB

20s/30s file size: 101 MB

d = f.create_dataset('variable',shape,maxshape =(None,3712,3712),dtype ='f',chunks =(96,58,58),compression ='gzip',compression_opts = 6 )

d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=6)

50s/58s文件大小:87 MB

50s/58s file size: 87 MB

d = f.create_dataset('variable',shape,maxshape =(None,3712,3712),dtype ='f',chunks =(96,58,58),compression ='gzip',compression_opts = 9 )

d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=9)

50s/60s文件大小:64 MB

50s/60s file size: 64 MB

现在是整个月(30天)的基准.写作进行了一些优化,并写成(96,3712,3712).

And now a benchmark of a whole month (30 days). The writing is a bit optimized and is written with (96,3712, 3712).

def modified_chunk_size():
    File_Name_HDF5='some_Path'

    Array_R=np.random.rand(1,3712,3712)
    Array=np.zeros((96,3712,3712),dtype=np.float32)
    for j in xrange(0,96):
        Array[j,:,:]=Array_R

    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    nodays=30

    shape = 96, 3712, 3712
    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

    #Writing
    t1=time.time()
    for i in xrange(0,96*nodays,96):
        d[i:i+96,:,:]=Array
        d.resize((d.shape[0]+96,shape[1],shape[2]))

    f.close()
    print(time.time()-t1)

    #Reading
    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    d=f['variable']
    for i in xrange(0,3712,58):
        for j in xrange(0,3712,58):
            A=np.copy(d[:,i:i+58,j:j+58])

    print(time.time()-t1)

133s/301s blosc

133s/301s with blosc

具有gzip compression_opts = 3的432s/684s

432s/684s with gzip compression_opts=3

在访问NAS上的数据时,我遇到了同样的问题.我希望这会有所帮助...

I had the same problems when accessing data on a NAS. I hope this helps...

这篇关于H5PY不坚持分块规范?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆