具有"where"条件限制的 pandas read_hdf? [英] pandas read_hdf with 'where' condition limitation?

查看:142
本文介绍了具有"where"条件限制的 pandas read_hdf?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用带有3个条件的where子句查询HDF5文件,其中一个条件是长度为30的列表:

I need to query an HDF5 file with where clause with 3 conditions, one of the condition is a list with a length of 30:

myList = list(xrange(30))

h5DF   = pd.read_hdf(h5Filename, 'df', where='index=myList & date=dateString & time=timeString')

上面的查询给了我ValueError: too many inputs并且错误是可重现的.

The query above gives me ValueError: too many inputs and the error is reproducible.

如果我将列表的长度减少到29(三个条件):

If I reduce length of the list to 29 (three conditions):

myList = list(xrange(29))

h5DF   = pd.read_hdf(h5Filename, 'df', where='index=myList & date=dateString & time=timeString')

OR 的条件数只有两个(列表长度为30):

OR number of conditions to only two (list length of 30):

然后执行正常:

myList = list(xrange(30))

h5DF   = pd.read_hdf(h5Filename, 'df', where='index=myList & time=timeString')

这是已知限制吗?熊猫文档,网址为 http://pandas.pydata. org/pandas-docs/dev/generated/pandas.io.pytables.read_hdf.html 并未提及此限制,并且在搜索此论坛后似乎还没有人遇到此限制.

Is this a known limitation? pandas documentation at http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.pytables.read_hdf.html doesn't mention about this limitation and seems like after searching this forum nobody encounter this limitation yet.

版本为pandas 0.15.2.感谢您的帮助.

Version is pandas 0.15.2. Any help is appreciated.

推荐答案

这里

这是一个缺陷,因为numpy/numexpr不能处理树中超过31个操作数. HDFStore所在位置的foo=[1,2,3,4]这样的表达式会生成(foo==1) | (foo==2) ....这样的表达式,因此它们会被扩展,如果太多,可能会失败.

This is a defect in that numpy/numexpr cannot handle more than 31 operands in the tree. An expression like foo=[1,2,3,4] in the where of the HDFStore generates an expression like (foo==1) | (foo==2) .... so these are expanded and if you have too many can fail.

HDFStore使用单个操作数来处理此问题(如果您只有foo=[range(31)],则可以使用IOW,但是由于您碰巧有一个嵌套的子表达式,其中子节点本身太长,因此会出错.

HDFStore handles this with a single operand (IOW if you just have foo=[range(31)] is ok, but because you happen to have a nested sub-expression where the sub-nodes themselves are too long it errors.

通常,执行此操作的更好方法是选择更大的范围(例如,每个操作数选择的终点),然后在内存中执行.isin.它甚至可能更快,因为HDF5在选择较大范围时(即使您将更多数据存储到内存中)往往比单独选择更有效,恕我直言.

Generally a better way to do this is to select a bigger range (e.g. maybe the end-points of the selection for each operand), then do an in-memory .isin. It might even be faster, because HDF5 tends to be more efficient IMHO when selecting larger ranges (even though you are bringing more data to memory), rather than individual selections.

这篇关于具有"where"条件限制的 pandas read_hdf?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆