用Python编写HDF5文件的最快方法? [英] Fastest way to write HDF5 files with Python?

查看:199
本文介绍了用Python编写HDF5文件的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个混合文本/数字的大型CSV文件(10 GB),在保持合理的内存使用率的同时,创建具有相同内容的HDF5文件的最快方法是什么?

Given a large (10s of GB) CSV file of mixed text/numbers, what is the fastest way to create an HDF5 file with the same content, while keeping the memory usage reasonable?

如果可能,我想使用h5py模块.

I'd like to use the h5py module if possible.

在下面的玩具示例中,我发现了一种将数据写入HDF5的非常慢和非常快的方法.最好的做法是以10,000行左右的块写入HDF5吗?还是有更好的方法将大量数据写入此类文件?

In the toy example below, I've found an incredibly slow and incredibly fast way to write data to HDF5. Would it be best practice to write to HDF5 in chunks of 10,000 rows or so? Or is there a better way to write a massive amount of data to such a file?

import h5py

n = 10000000
f = h5py.File('foo.h5','w')
dset = f.create_dataset('int',(n,),'i')

# this is terribly slow
for i in xrange(n):
  dset[i] = i

# instantaneous
dset[...] = 42

推荐答案

我将避免对数据进行分块,并将数据存储为一系列单数组数据集(沿Benjamin建议的内容).我刚刚将一直在研究的企业应用程序的输出加载到HDF5中,并且能够将大约45亿个复合数据类型打包为450,000个数据集,每个数据集包含10,000个数据数组.现在,读写似乎是瞬时的,但是当我最初尝试对数据进行分块时,速度却很慢.

I would avoid chunking the data and would store the data as series of single array datasets (along the lines of what Benjamin is suggesting). I just finished loading the output of an enterprise app I've been working on into HDF5, and was able to pack about 4.5 Billion compound datatypes as 450,000 datasets, each containing a 10,000 array of data. Writes and reads now seem fairly instantaneous, but were painfully slow when I initially tried to chunk the data.

只是一个想法!

更新:

这些是从我的实际代码(我使用C与Python进行编码,但您应该了解我在做什么)中摘录的几段代码,并进行了修改以使内容更加清晰.我只是在数组中写长无符号整数(每个数组10,000个值),并在需要实际值时读回它们

These are a couple of snippets lifted from my actual code (I'm coding in C vs. Python, but you should get the idea of what I'm doing) and modified for clarity. I'm just writing long unsigned integers in arrays (10,000 values per array) and reading them back when I need an actual value

这是我典型的编写者代码.在这种情况下,我只是将长的无符号整数序列写入数组序列,并在创建它们时将每个数组序列加载到hdf5中.

This is my typical writer code. In this case, I'm simply writing long unsigned integer sequence into a sequence of arrays and loading each array sequence into hdf5 as they are created.

//Our dummy data: a rolling count of long unsigned integers
long unsigned int k = 0UL;
//We'll use this to store our dummy data, 10,000 at a time
long unsigned int kValues[NUMPERDATASET];
//Create the SS adata files.
hid_t ssdb = H5Fcreate(SSHDF, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
//NUMPERDATASET = 10,000, so we get a 1 x 10,000 array
hsize_t dsDim[1] = {NUMPERDATASET};
//Create the data space.
hid_t dSpace = H5Screate_simple(1, dsDim, NULL);
//NUMDATASETS = MAXSSVALUE / NUMPERDATASET, where MAXSSVALUE = 4,500,000,000
for (unsigned long int i = 0UL; i < NUMDATASETS; i++){
    for (unsigned long int j = 0UL; j < NUMPERDATASET; j++){
        kValues[j] = k;
        k += 1UL;
    }
    //Create the data set.
    dssSet = H5Dcreate2(ssdb, g_strdup_printf("%lu", i), H5T_NATIVE_ULONG, dSpace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    //Write data to the data set.
    H5Dwrite(dssSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, kValues);
    //Close the data set.
    H5Dclose(dssSet);
}
//Release the data space
H5Sclose(dSpace);
//Close the data files.
H5Fclose(ssdb);

这是我的阅读器代码的略微修改版本.可以使用更优雅的方法(例如,我可以使用超平面来获得价值),但是对于我训练有素的Agile/BDD开发过程而言,这是最干净的解决方案.

This is a slightly modified version of my reader code. There are more elegant ways of doing this (i.e., I could use hyperplanes to get the value), but this was the cleanest solution with respect to my fairly disciplined Agile/BDD development process.

unsigned long int getValueByIndex(unsigned long int nnValue){
    //NUMPERDATASET = 10,000
    unsigned long int ssValue[NUMPERDATASET];
    //MAXSSVALUE = 4,500,000,000; i takes the smaller value of MAXSSVALUE or nnValue
    //to avoid index out of range error 
    unsigned long int i = MIN(MAXSSVALUE-1,nnValue);
    //Open the data file in read-write mode.
    hid_t db = H5Fopen(_indexFilePath, H5F_ACC_RDONLY, H5P_DEFAULT);
    //Create the data set. In this case, each dataset consists of a array of 10,000
    //unsigned long int and is named according to its integer division value of i divided
    //by the number per data set.
    hid_t dSet = H5Dopen(db, g_strdup_printf("%lu", i / NUMPERDATASET), H5P_DEFAULT);
    //Read the data set array.
    H5Dread(dSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, ssValue);
    //Close the data set.
    H5Dclose(dSet);
    //Close the data file.
    H5Fclose(db);
    //Return the indexed value by using the modulus of i divided by the number per dataset
    return ssValue[i % NUMPERDATASET];
}

主要要点是编写代码中的内部循环以及整数除法和mod操作,以获取数据集数组的索引和该数组中所需值的索引.让我知道这是否足够清楚,以便您可以在h5py中组合类似或更好的内容.在C语言中,这非常简单,与分块数据集解决方案相比,它使我的读写时间明显缩短.另外,由于我无论如何都无法对化合物数据集使用压缩,因此分块的明显好处是有争议的,因此所有化合物都以相同的方式存储.

The main take-away is the inner loop in the writing code and the integer division and mod operations to get the index of the dataset array and index of the desired value in that array. Let me know if this is clear enough so you can put together something similar or better in h5py. In C, this is dead simple and gives me significantly better read/write times vs. a chunked dataset solution. Plus since I can't use compression with compound datasets anyway, the apparent upside of chunking is a moot point, so all my compounds are stored the same way.

这篇关于用Python编写HDF5文件的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆