如何使用Pandas / Python查询HDF商店 [英] How to query an HDF store using Pandas/Python

查看:148
本文介绍了如何使用Pandas / Python查询HDF商店的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了管理我在分析中消耗的RAM数量,我有一个存储在hdf5(.h5)中的大型数据集,我需要使用Pandas高效地查询这个数据集。



数据集包含一组应用程序的用户性能数据。我只想从40个可能的领域中抽出几个领域,然后将结果数据框筛选到仅使用少数几个让我感兴趣的应用程序之一的用户。

 #我想分析的应用列表
apps = ['a','d','f']

#Users.h5只包含一个名为'df'的field_table
store = pd.HDFStore('Users.h5')

#以下查询正常工作
df = store.select('df ',columns = ['account','metric1','metric2'],其中= ['Month == 10','IsMessager == 1'])

#以下伪查询失败
df = store.select('df',columns = ['account','metric1','metric2'],where = ['Month == 10','IsMessager == 1','app在应用程序'])

我意识到字符串'app in apps'不是我想要的。这简直就是我希望实现的类似于SQL的表示形式。我似乎无法以任何方式传递字符串列表,但必须有一种方法。



现在我只是在没有此参数的情况下运行查询,那么我会在后续步骤中过滤掉我不想要的应用程序。

  df = df [df ['app']] .isin(apps)] 

但是这样做效率低得多,因为所有的应用程序都需要首先加载到内存中,然后我可以删除它们。在某些情况下,这是个大问题,因为我没有足够的内存来支持整个未过滤的df。

解决方案

非常接近。

  In [1]:df = DataFrame({'A':['foo','foo', 'bar','bar','baz'],
'B':[1,2,1,2,1],
'C':np.random.randn(5)} )

在[2]中:df
出[2]:
ABC
0 foo 1 -0.909708
1 foo 2 1.321838
2栏1 0.368994
3栏2 -0.058657
4 baz 1 -1.159151

[5行x 3栏]

将商店写成表格(注意,在0.12中,您将使用 table = True ,而不是格式='表)。请记住指定创建表时要查询的 data_columns (或者您可以执行 data_columns = True

 在[3]中:df.to_hdf('test.h5','df',mode ='w',format = 'table',data_columns = ['A','B'])

在[4]中:pd.read_hdf('test.h5','df')
Out [4 ]:
ABC
0 foo 1 -0.909708
1 foo 2 1.321838
2 bar 1 0.368994
3 bar 2 -0.058657
4 baz 1 -1.159151

[5行x 3列]

isin通过 query_column = list_of_values 完成。

 在[8]中:pd.read_hdf('test.h5','df ',where ='A = [foo,bar]& B = 1')
Out [8]:
ABC
0 foo 1 -0.909708
2 bar 1 0.368994

[2 rows x 3 columns]

语法in 0.12,这必须是一个列表(这和条件)。

 在[11]:pd.read_hdf('test。 h5','df',其中= [pd.Term('A','=',[foo,bar]),'B = 1'])
Out [11]:
ABC
0 foo 1 -0.909708
2 bar 1 0.368994

[2 rows x 3 columns]


To manage the amount of RAM I consume in doing an analysis, I have a large dataset stored in hdf5 (.h5) and I need to query this dataset efficiently using Pandas.

The data set contains user performance data for a suite of apps. I only want to pull a few fields out of the 40 possible, and then filter the resulting dataframe to only those users who are using a one of a few apps that interest me.

# list of apps I want to analyze
apps = ['a','d','f']

# Users.h5 contains only one field_table called 'df'
store = pd.HDFStore('Users.h5')

# the following query works fine
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1'])

# the following pseudo-query fails
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1', 'app in apps'])

I realize that the string 'app in apps' is not what I want. This is simply a SQL-like representation of what I hope to achieve. I cant seem to pass a list of strings in any way that I try, but there must be a way.

For now I am simply running the query without this parameter and then I filter out the apps I don't want in a subsequent step thusly

df = df[df['app'].isin(apps)]

But this is much less efficient since ALL of the apps need to first be loaded into memory before I can remove them. In some cases, this is big problem because I don't have enough memory to support the whole unfiltered df.

解决方案

You are pretty close.

In [1]: df = DataFrame({'A' : ['foo','foo','bar','bar','baz'],
                        'B' : [1,2,1,2,1], 
                        'C' : np.random.randn(5) })

In [2]: df
Out[2]: 
     A  B         C
0  foo  1 -0.909708
1  foo  2  1.321838
2  bar  1  0.368994
3  bar  2 -0.058657
4  baz  1 -1.159151

[5 rows x 3 columns]

Write the store as a table (note that in 0.12 you will use table=True, rather than format='table'). Remember to specify the data_columns that you want to query when creating the table (or you can do data_columns=True)

In [3]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=['A','B'])

In [4]: pd.read_hdf('test.h5','df')
Out[4]: 
     A  B         C
0  foo  1 -0.909708
1  foo  2  1.321838
2  bar  1  0.368994
3  bar  2 -0.058657
4  baz  1 -1.159151

[5 rows x 3 columns]

Syntax in master/0.13, isin is accomplished via query_column=list_of_values. This is presented as a string to where.

In [8]: pd.read_hdf('test.h5','df',where='A=["foo","bar"] & B=1')
Out[8]: 
     A  B         C
0  foo  1 -0.909708
2  bar  1  0.368994

[2 rows x 3 columns]

Syntax in 0.12, this must be a list (which ands the conditions).

In [11]: pd.read_hdf('test.h5','df',where=[pd.Term('A','=',["foo","bar"]),'B=1'])
Out[11]: 
     A  B         C
0  foo  1 -0.909708
2  bar  1  0.368994

[2 rows x 3 columns]

这篇关于如何使用Pandas / Python查询HDF商店的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆