从pandas.HDFStore表中选择列 [英] Selecting columns from pandas.HDFStore table

查看:167
本文介绍了从pandas.HDFStore表中选择列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从大熊猫HDFStore中检索特定列?我经常使用非常大的数据集,这些数据集太大而无法在内存中操作。我想迭代读取一个csv文件,将每个块附加到HDFStore对象中,然后处理数据的子集。我读了一个简单的csv文件,并使用以下代码将其加载到HDFStore中:

  tmp = pd.HDFStore(' test.h5')
chunker = pd.read_csv('cars.csv',iterator = True,chunksize = 10,names = ['make','model','drop'])
tmp .append('df',pd.concat([chunk for chunk chunk],ignore_index = True))

输出:

  In [97]:tmp 
输出[97]:
< class'pandas.io.pytables.HDFStore'>
文件路径:test.h5
/ df frame_table(typ-> appendable,nrows-> 1930,indexers-> [index])

我的问题是如何访问 tmp ['df'] 中的特定列?文档中提到了一个 select()方法和一些 Term 对象。所提供的例子适用于Panel数据;然而,我太过于将它扩展到更简单的数据框的情况。我的猜测是我必须以某种方式创建列的索引。感谢!

解决方案

HDFStore记录表的方式是,列按类型存储为单个numpy数组。你总是找回所有的列,你可以过滤它们,所以你会回来为你的要求。在0.10.0中,您可以传递一个涉及列的Term。

  store.select('df',[Term('index ','>',Timestamp('20010105')),
Term('columns','=',['A','B'])])

或者您可以重新编制索引

  df = store.select('df',[Term('index','>',Timestamp('20010105'))
df.reindex(columns = ['A','B'])



轴在这里并不是真正的解决方案(你实际上创建的实际上是存储了一个转置帧),这个参数允许你重新排序轴的存储,以不同的方式实现数据对齐,对于数据帧来说,它并不意味着太多;对于3d或4d结构,磁盘上的数据对齐对于真正快速的查询来说是至关重要的。

0.10.1将允许一个更优雅的解决方案,即数据列,也就是说,您可以选择某些列被表示为在表格商店中有自己的列,所以你真的可以选择它们。

  store.append('df',columns = ['A','B' ,'C'])
store.select('df',['A> 0',Term('index','>',Timestamp(2000105))])

另一种方法是将单独的表存储在文件的不同节点中,然后您只能选择所需的内容。

一般来说,我推荐再宽真正的桌子。海顿提供了面板解决方案,这对您而言可能是一个好处,因为实际的数据安排应反映您想要如何查询数据。


How can I retrieve specific columns from a pandas HDFStore? I regularly work with very large data sets that are too big to manipulate in memory. I would like to read in a csv file iteratively, append each chunk into HDFStore object, and then work with subsets of the data. I have read in a simple csv file and loaded it into an HDFStore with the following code:

tmp = pd.HDFStore('test.h5')
chunker = pd.read_csv('cars.csv', iterator=True, chunksize=10, names=['make','model','drop'])
tmp.append('df', pd.concat([chunk for chunk in chunker], ignore_index=True))

And the output:

In [97]: tmp
Out[97]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df     frame_table (typ->appendable,nrows->1930,indexers->[index])

My Question is how do I access specific columns from tmp['df']? The documenation makes mention of a select() method and some Term objects. The examples provided are applied to Panel data; however, and I'm too much of a novice to extend it to the simpler data frame case. My guess is that I have to create an index of the columns somehow. Thanks!

解决方案

The way HDFStore records tables, the columns are stored by type as single numpy arrays. You always get back all of the columns, you can filter on them, so you will be returned for what you ask. In 0.10.0 you can pass a Term that involves columns.

store.select('df', [ Term('index', '>', Timestamp('20010105')), 
                     Term('columns', '=', ['A','B']) ])

or you can reindex afterwards

df = store.select('df', [ Term('index', '>', Timestamp('20010105') ])
df.reindex(columns = ['A','B'])

The axes is not really the solution here (what you actually created was in effect storing a transposed frame). This parameter allows you to re-order the storage of axes to enable data alignment in different ways. For a dataframe it really doesn't mean much; for 3d or 4d structures, on-disk data alignment is crucial for really fast queries.

0.10.1 will allow a more elegant solution, namely data columns, that is, you can elect certain columns to be represented as there own columns in the table store, so you really can select just them. Here is a taste what is coming.

 store.append('df', columns = ['A','B','C'])
 store.select('df', [ 'A > 0', Term('index', '>', Timestamp(2000105)) ])

Another way to do go about this is to store separate tables in different nodes of the file, then you can select only what you need.

In general, I recommend again really wide tables. hayden offers up the Panel solution, which might be a benefit for you now, as the actual data arangement should reflect how you want to query the data.

这篇关于从pandas.HDFStore表中选择列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆