hdf5和ndarray附加/省时的大数据集方法 [英] hdf5 and ndarray append / time-efficient approach for large data-sets

查看：114 发布时间：2020/5/18 20:56:17 python performance numpy hdf5

本文介绍了hdf5和ndarray附加/省时的大数据集方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

背景

我有一个k个n维时间序列，每个序列都表示为mx(n + 1)个数组，其中包含浮点值(n列加一个代表日期的值).

I have a k n-dimensional time-series, each represented as m x (n+1) array holding float values (n columns plus one that represents the date).

示例:

k(大约400万个)时序，

k (around 4 million) time-series that look like

20100101    0.12    0.34    0.45    ...
20100105    0.45    0.43    0.21    ...
...         ...     ...     ...

每天，我想为数据集的子集(< k)添加另一行.所有数据集均以组的形式存储在一个 hd5f文件中.

Each day, I want to add for a subset of the data sets (< k) an additional row. All datasets are stored in groups in one hd5f file.

问题

将行添加到数据集的最省时方法是什么?

What is the most time-efficient approach to append the rows to the data sets?

输入是一个看起来像

key1, key2, key3, key4, date, value1, value2, ...

由此日期对于特定文件是唯一的，可以忽略.我有大约400万个数据集.问题是我必须查找键，获取完整的numpy数组，调整数组的大小，添加行并再次存储该数组. hd5f文件的总大小约为100 GB.知道如何加快速度吗? 我认为我们可以同意使用SQLite或类似方法不起作用-一旦获得所有数据，平均数据集将超过100万个元素乘以400万个数据集.

whereby date is unique for the particular file and could be ignored. I have around 4 million data sets. The issue is that I have to look-up the key, get the complete numpy array, resize the array, add the row and store the array again. The total size of the hd5f file is around 100 GB. Any idea how to speed this up? I think we can agree that using SQLite or something similar doesn't work - as soon as I have all the data, an average data set will have over 1 million elements times 4 million data sets.

谢谢！

hdf5和ndarray附加/省时的大数据集方法 [英] hdf5 and ndarray append / time-efficient approach for large data-sets

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

hdf5和ndarray附加/省时的大数据集方法 [英] hdf5 and ndarray append / time-efficient approach for large data-sets

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭