hdf5和ndarray附加/省时的大数据集方法 [英] hdf5 and ndarray append / time-efficient approach for large data-sets

查看:114
本文介绍了hdf5和ndarray附加/省时的大数据集方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景

我有一个k个n维时间序列,每个序列都表示为mx(n + 1)个数组,其中包含浮点值(n列加一个代表日期的值).

I have a k n-dimensional time-series, each represented as m x (n+1) array holding float values (n columns plus one that represents the date).

示例:

k(大约400万个)时序,

k (around 4 million) time-series that look like

20100101    0.12    0.34    0.45    ...
20100105    0.45    0.43    0.21    ...
...         ...     ...     ... 

每天,我想为数据集的子集(< k)添加另一行.所有数据集均以组的形式存储在一个 hd5f文件中.

Each day, I want to add for a subset of the data sets (< k) an additional row. All datasets are stored in groups in one hd5f file.

问题

将行添加到数据集的最省时方法是什么?

What is the most time-efficient approach to append the rows to the data sets?

输入是一个看起来像

key1, key2, key3, key4, date, value1, value2, ... 

由此日期对于特定文件是唯一的,可以忽略.我有大约400万个数据集.问题是我必须查找键,获取完整的numpy数组,调整数组的大小,添加行并再次存储该数组. hd5f文件的总大小约为100 GB.知道如何加快速度吗? 我认为我们可以同意使用SQLite或类似方法不起作用-一旦获得所有数据,平均数据集将超过100万个元素乘以400万个数据集.

whereby date is unique for the particular file and could be ignored. I have around 4 million data sets. The issue is that I have to look-up the key, get the complete numpy array, resize the array, add the row and store the array again. The total size of the hd5f file is around 100 GB. Any idea how to speed this up? I think we can agree that using SQLite or something similar doesn't work - as soon as I have all the data, an average data set will have over 1 million elements times 4 million data sets.

谢谢!

推荐答案

您是否看过 PyTables ?这是一个建立在HDF5库之上的分层数据库.

Have you looked at PyTables? It's a hierarchical database built on top of the HDF5 library.

它有几种数组类型,但是表"类型听起来像适用于您的数据格式.它基本上是NumPy记录数组的磁盘版本,其中每一列可以是唯一的数据类型.表具有一个append方法,可以轻松添加其他行.

It has several array types, but the "table" type sounds like it would work for your data format. It's basically an on-disk version of a NumPy record array, where each column can be a unique data type. Tables have an append method that can easily add additional rows.

就从CSV文件加载数据而言,numpy.loadtxt相当快.它将文件作为NumPy记录数组加载到内存中.

As far as loading the data from CSV files, numpy.loadtxt is quite fast. It will load the file into memory as a NumPy record array.

这篇关于hdf5和ndarray附加/省时的大数据集方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆