临时 Numpy 数组的数据库或表解决方案 [英] Database or Table Solution for Temporary Numpy Arrays

查看:51
本文介绍了临时 Numpy 数组的数据库或表解决方案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个 Python 桌面应用程序,它允许用户选择不同的分布形式来模拟农业产量数据.我有时间序列农业数据 - 接近一百万行 - 保存在 SQLite 数据库中(尽管如果有人知道更好的选择,这并不是一成不变的).一旦用户选择了一些数据,比如伊利诺伊州 1990-2010 年的玉米产量,我希望他们从下拉列表中选择一个分布形式.接下来,我的函数将分布拟合到数据中,并从 Numpy 数组中的拟合分布形式中输出 10,000 个点.我希望这些数据在程序执行期间是临时的.

I am creating a Python desktop application that allows users to select different distributional forms to model agricultural yield data. I have the time series agricultural data - close to a million rows - saved in a SQLite database (although this is not set in stone if someone knows of a better choice). Once the user selects some data, say corn yields from 1990-2010 in Illinois, I want them to select a distributional form from a drop-down. Next, my function fits the distribution to the data and outputs 10,000 points drawn from that fitted distributional form in a Numpy array. I would like this data to be temporary during the execution of the program.

为了提高效率,我只想对指定的区域和分布进行一次拟合和随后的数字绘制.我一直在研究 Python 中的临时文件,但我不确定这是保存许多不同 Numpy 数组的最佳方法.PyTables 看起来也是一种有趣的方法,似乎与 Numpy 兼容,但我不确定它是否适合处理临时数据.没有 SQL 解决方案,如 MongoDB,最近似乎也很流行,从简历构建的角度来看,这也让我感兴趣.

In an attempt to be efficient, I would only like to make this fit and the subsequent drawing of numbers one time for a specified region and distribution. I have been researching temporary files in Python, but I am not sure that is the best approach for saving many different Numpy arrays. PyTables also looks like an interesting approach and seems to be compatible with Numpy, but I am not sure it is good for handling temporary data. No SQL solutions, like MongoDB, seem to be very popular these days as well, which also interests me from a resume building perspective.

在阅读下面的评论并对其进行研究后,我将使用 PyTables,但我正在尝试找到解决此问题的最佳方法.是否可以创建一个如下所示的表,而不是 Float32Col 我可以使用来自 scikits 时间序列类的 createTimeSeriesTable() 或者我是否需要为日期创建一个日期时间列和一个用于掩码的布尔列,除了下面的 Float32Col 来保存数据.或者有没有更好的方法来解决这个问题?

After reading the comment below and researching it, I am going to go with PyTables, but I am trying to find the best way to tackle this. Is it possible to create a table like below, where instead of Float32Col I can use createTimeSeriesTable() from the scikits time series class or do I need to create a datetime column for the date and a boolean column for the mask, in addition to the Float32Col below to hold the data. Or is there a better way to be going about this problem?

class Yield(IsDescription):
    geography_id = UInt16Col()
    data = Float32Col(shape=(50, 1)) # for 50 years of data

对此事的任何帮助将不胜感激.

Any help on the matter would be greatly appreciated.

推荐答案

您的临时数据用例是什么?您是否打算一次性阅读所有内容(并且永远不想只阅读一个子集)?

What's your use case for the temporary data? Are you just going to be reading it all in at once (and never wanting to just read in a subset)?

如果是这样,只需将数组保存到一个临时文件(例如,使用 numpy.save,或等效地,使用二进制协议的 pickle).在这种情况下,不需要更好的解决方案.

If so, just save the array to a temporary file (e.g. with numpy.save, or equivalently, pickle with a binary protocol). There's no need for fancier solutions in that case.

顺便说一句,我强烈推荐使用 PyTables 而不是 SQLite 来存储您的原始时间序列数据.

On a side note, I'd highly recommend PyTables over SQLite for storing your original time series data.

根据您正在做的事情,您不需要关系数据库的关系"部分(例如连接).如果您不需要连接或关联表,您只需要快速简单的查询,并且您希望内存中的数据作为一个 numpy 数组,PyTables 是一个很好的选择.PyTables 使用 HDF 来存储您的数据,它在磁盘上比 SQLite 数据库要紧凑得多.PyTables 将大块数据作为 numpy 数组加载到内存中的速度也快得多.

Based on what it sounds like you're doing, you're not going to need the "relational" parts of a relational database (e.g. joins). If you don't need to join or relate tables, you just need fast simple queries, and you want the data in memory as a numpy array, PyTables is an excellent option. PyTables uses HDF to store your data, which can be much more compact on disk than a SQLite database. PyTables is also considerably faster for loading large chunks of data into memory as numpy arrays.

这篇关于临时 Numpy 数组的数据库或表解决方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆