附带条件将HDF5文件读取到Pandas DataFrame [英] read HDF5 file to pandas DataFrame with conditions

查看:269
本文介绍了附带条件将HDF5文件读取到Pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的HDF5文件,我想将其中的一部分加载到pandas DataFrame中以执行一些操作,但是我对过滤某些行感兴趣.

I have a huge HDF5 file, I want to load part of it in a pandas DataFrame to perform some operations, but I am interested in filtering some rows.

我可以用一个例子更好地解释:

I can explain better with an example:

原始HDF5文件如下所示:

Original HDF5 file would look something like:

A    B    C    D
1    0    34   11
2    0    32   15
3    1    35   22
4    1    34   15
5    1    31   9
1    0    34   15
2    1    29   11
3    0    34   15
4    1    12   14
5    0    34   15
1    0    32   13
2    1    34   15
etc  etc  etc  etc

我想做的就是将它原样加载到pandas Dataframe中,但只加载where A==1 or 3 or 4

What I am trying to do is to load this, exactly as it is, to a pandas Dataframe but only where A==1 or 3 or 4

直到现在,我可以使用以下命令加载整个HDF5:

Until now I can just load the whole HDF5 using:

store = pd.HDFStore('Resutls2015_10_21.h5')
df = pd.DataFrame(store['results_table'])

我在这里看不到如何包含where条件.

I do not see how to include a where condition here.

推荐答案

hdf5文件必须用

The hdf5 file must be written in table format (as opposed to fixed format) in order to be queryable with pd.read_hdf's where argument.

此外,A必须为声明为data_column :

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

或者,将所有列指定为(可查询的)数据列:

or, to specify all columns as (queryable) data columns:

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True,
          format='table')

那么您就可以使用

pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')

选择值列A为1、3或4的行.例如,

to select rows where the value column A is 1, 3 or 4. For example,

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2],
    'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1],
    'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34],
    'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]})

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]'))

收益

    A  B   C   D
0   1  0  34  11
2   3  1  35  22
3   4  1  34  15
5   1  0  34  15
7   3  0  34  15
8   4  1  12  14
10  1  0  32  13


如果有很长的值列表vals,则可以使用字符串格式来组成正确的where参数:


If you have a very long list of values, vals, then you could use string formatting to compose the right where argument:

where='A in {}'.format(vals)

这篇关于附带条件将HDF5文件读取到Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆