以SQLite和HDF5格式从numpy,scipy导出/导入 [英] exporting from/importing to numpy, scipy in SQLite and HDF5 formats

查看:325
本文介绍了以SQLite和HDF5格式从numpy,scipy导出/导入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python似乎有很多选择可以与SQLite(sqlite3,atpy)和HDF5(h5py,pyTables)接口-我想知道是否有人在将它们与numpy数组或数据表(结构化/记录数组)一起使用方面有经验,以及其中哪一种与每种数据格式(SQLite和HDF5)与科学"模块(numpy,scipy)无缝集成.

解决方案

大部分取决于您的用例.

与传统的关系数据库相比,我在处理各种基于HDF5的方法方面有更多的经验,因此对于python的SQLite库,我不能发表过多评论.

至少在h5pypyTables上,它们都可以通过numpy数组提供非常无缝的访问,但是它们的使用方向却截然不同.

如果您有n维数据要快速访问其任意基于基于索引的切片,那么使用h5py要简单得多.如果您的数据更像表,并且要查询它,那么pyTables是一个更好的选择.

pyTables相比,

h5py是围绕HDF5库的相对原始"包装.如果您要定期从另一种语言访问HDF文件(pyTables添加一些额外的元数据),这是一件好事. h5py可以做很多 ,但是对于某些用例(例如pyTables的用途),您将需要花费更多的时间进行调整.

pyTables具有一些 really 不错的功能.但是,如果您的数据看起来不太像表格,那么它可能不是最佳选择.

举一个更具体的例子,我处理了相当大的(数十GB)3维和4维数据数组.它们是float,int,uint8s等的均匀数组.我通常想访问整个数据集的一小部分. h5py使此非常很简单,并且在自动猜测合理的块大小方面做得相当好.从磁盘上抓取任意块或切片的速度比简单的映射文件要快得多. (强调任意.显然,如果要获取整个"X"片,则无法击败C顺序的内存阵列,因为"X"片中的所有数据在磁盘上都是相邻的.) /p>

作为一个反例,我的妻子从各种各样的传感器收集数据,这些传感器在过去几年中每隔一秒到一秒采样一次.她需要对数据进行存储和运行任意查询(和相对简单的计算). pyTables使此用例变得非常容易和快速,并且相对于传统的关系数据库仍具有一些优势. (特别是在磁盘使用率和大块(基于索引)数据可以读入内存的速度方面)

There seems to be many choices for Python to interface with SQLite (sqlite3, atpy) and HDF5 (h5py, pyTables) -- I wonder if anyone has experience using these together with numpy arrays or data tables (structured/record arrays), and which of these most seamlessly integrate with "scientific" modules (numpy, scipy) for each data format (SQLite and HDF5).

解决方案

Most of it depends on your use case.

I have a lot more experience dealing with the various HDF5-based methods than traditional relational databases, so I can't comment too much on SQLite libraries for python...

At least as far as h5py vs pyTables, they both offer very seamless access via numpy arrays, but they're oriented towards very different use cases.

If you have n-dimensional data that you want to quickly access an arbitrary index-based slice of, then it's much more simple to use h5py. If you have data that's more table-like, and you want to query it, then pyTables is a much better option.

h5py is a relatively "vanilla" wrapper around the HDF5 libraries compared to pyTables. This is a very good thing if you're going to be regularly accessing your HDF file from another language (pyTables adds some extra metadata). h5py can do a lot, but for some use cases (e.g. what pyTables does) you're going to need to spend more time tweaking things.

pyTables has some really nice features. However, if your data doesn't look much like a table, then it's probably not the best option.

To give a more concrete example, I work a lot with fairly large (tens of GB) 3 and 4 dimensional arrays of data. They're homogenous arrays of floats, ints, uint8s, etc. I usually want to access a small subset of the entire dataset. h5py makes this very simple, and does a fairly good job of auto-guessing a reasonable chunk size. Grabbing an arbitrary chunk or slice from disk is much, much faster than for a simple memmapped file. (Emphasis on arbitrary... Obviously, if you want to grab an entire "X" slice, then a C-ordered memmapped array is impossible to beat, as all the data in an "X" slice are adjacent on disk.)

As a counter example, my wife collects data from a wide array of sensors that sample at minute to second intervals over several years. She needs to store and run arbitrary querys (and relatively simple calculations) on her data. pyTables makes this use case very easy and fast, and still has some advantages over traditional relational databases. (Particularly in terms of disk usage and speed at which a large (index-based) chunk of data can be read into memory)

这篇关于以SQLite和HDF5格式从numpy,scipy导出/导入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆