pandas read_hdf按日期和时间范围查询 [英] Pandas read_hdf query by date and time range

查看:198
本文介绍了 pandas read_hdf按日期和时间范围查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对如何在pd.read_hdf函数中过滤结果有疑问.所以这是设置,我有一个熊猫数据框(带有np.datetime64索引),并将其放入hdf5文件中.这里没有任何幻想,因此无需使用层次结构或任何东西(也许我可以将其合并?).这是一个示例:

I have a question regarding how to filter results in the pd.read_hdf function. So here's the setup, I have a pandas dataframe (with np.datetime64 index) which I put into a hdf5 file. There's nothing fancy going on here, so no use of hierarchy or anything (maybe I could incorporate it?). Here's an example:

                              Foo          Bar
TIME                                         
2014-07-14 12:02:00            0            0
2014-07-14 12:03:00            0            0
2014-07-14 12:04:00            0            0
2014-07-14 12:05:00            0            0
2014-07-14 12:06:00            0            0
2014-07-15 12:02:00            0            0
2014-07-15 12:03:00            0            0
2014-07-15 12:04:00            0            0
2014-07-15 12:05:00            0            0
2014-07-15 12:06:00            0            0
2014-07-16 12:02:00            0            0
2014-07-16 12:03:00            0            0
2014-07-16 12:04:00            0            0
2014-07-16 12:05:00            0            0
2014-07-16 12:06:00            0            0

现在,我使用以下命令将其存储到.h5中:

Now I store this into a .h5 using the following command:

store = pd.HDFStore('qux.h5')
#generate df
store.append('data', df)
store.close()

接下来,我将有另一个访问此数据的过程,我想获取此数据的日期/时间片.因此,假设我想要的日期在2014-07-14和2014-07-15之间,并且仅用于12:02:00和12:04:00之间的时间.目前,我正在使用以下命令来检索此内容:

Next, I'll have another process which accesses this data and I would like to take date/time slices of this data. So suppose I want dates between 2014-07-14 and 2014-07-15, and only for times between 12:02:00 and 12:04:00. Currently I am using the following command to retrieve this:

pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715').between_time(start_time=datetime.time(12,2), end_time=datetime.time(12,4))

据我所知,如果我在这里错了,请有人纠正我,但是如果我使用"where",则不会将整个原始数据集读入内存.换句话说:

As far as I'm aware, someone please correct me if I'm wrong here, but entire original dataset is not read into memory if I use 'where'. So in other words:

此:

pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715')

与此不同:

pd.read_hdf('qux.h5', 'data')['20140714':'20140715']

尽管最终结果完全相同,但在后台执行的却不同.所以我的问题是,有没有办法将该时间范围过滤器(即.between_time())合并到我的where语句中?或者,如果还有另一种方法可以构造我的hdf5文件?也许每天都存储一张桌子?

While the end result is exactly the same, what's being done in the background is not. So my question is, is there a way to incorporate that time range filter (i.e. .between_time()) into my where statement? Or if there's another way I should structure my hdf5 file? Maybe store a table for each day?

谢谢!

关于使用层次结构,我知道结构应该高度依赖于我将如何使用数据.但是,如果我们假设我为每个日期定义了一个表(例如'df/date_20140714','df/date_20140715'...).同样,在这里我可能会弄错,但是以我查询日期/时间范围的示例为例;我可能会受到性能损失,因为如果我想要一个合并的输出,我将需要读取每个表并必须将它们合并?

Regarding using hierarchy, I'm aware that the structure should be highly dependent on how I'll be using the data. However, if we assume that the I define a table per date (e.g. 'df/date_20140714', 'df/date_20140715', ...). Again I may be mistaken here, but using my example of querying date/time range; I'll probably incur a performance penalty as I'll need to read each table and have to merge them if I want a consolidated output right?

推荐答案

查看使用这是一个例子

In [50]: pd.set_option('max_rows',10)

In [51]: df = DataFrame(np.random.randn(1000,2),index=date_range('20130101',periods=1000,freq='H'))

In [52]: df
Out[52]: 
                            0         1
2013-01-01 00:00:00 -0.467844  1.038375
2013-01-01 01:00:00  0.057419  0.914379
2013-01-01 02:00:00 -1.378131  0.187081
2013-01-01 03:00:00  0.398765 -0.122692
2013-01-01 04:00:00  0.847332  0.967856
...                       ...       ...
2013-02-11 11:00:00  0.554420  0.777484
2013-02-11 12:00:00 -0.558041  1.833465
2013-02-11 13:00:00 -0.786312  0.501893
2013-02-11 14:00:00 -0.280538  0.680498
2013-02-11 15:00:00  1.533521 -1.992070

[1000 rows x 2 columns]

In [53]: store = pd.HDFStore('test.h5',mode='w')

In [54]: store.append('df',df)

In [55]: c = store.select_column('df','index')

In [56]: where = pd.DatetimeIndex(c).indexer_between_time('12:30','4:00')

In [57]: store.select('df',where=where)
Out[57]: 
                            0         1
2013-01-01 00:00:00 -0.467844  1.038375
2013-01-01 01:00:00  0.057419  0.914379
2013-01-01 02:00:00 -1.378131  0.187081
2013-01-01 03:00:00  0.398765 -0.122692
2013-01-01 04:00:00  0.847332  0.967856
...                       ...       ...
2013-02-11 03:00:00  0.902023  1.416775
2013-02-11 04:00:00 -1.455099 -0.766558
2013-02-11 13:00:00 -0.786312  0.501893
2013-02-11 14:00:00 -0.280538  0.680498
2013-02-11 15:00:00  1.533521 -1.992070

[664 rows x 2 columns]

In [58]: store.close()

需要注意的点.这将读取整个索引以开始.通常这不是负担.如果是这样,您可以分块读取它(提供启动/停止,尽管这样做有点手动,但可以执行此ATM).我不相信当前的select_column也可以接受查询.

Couple of points to note. This reads in the entire index to start. Usually this is not a burden. If it is you can just chunk read it (provide start/stop, though its a bit manual to do this ATM). Current select_column I don't believe can accept a query either.

如果您有大量的数据(数以千万计的行,很宽),这可能会提高效率.

You could potentially iterate over the days (and do individual queries) if you have a gargantuan amount of data (tens of millions of rows, which are wide), which might be more efficient.

(通过concat)推荐数据相对便宜,因此不要害怕进行子查询(尽管这样做太多也会拖累性能).

Recombing data is relatively cheap (via concat), so don't be afraid to sub-query (though doing this too much can drag perf as well).

这篇关于 pandas read_hdf按日期和时间范围查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆