PyTables读取随机子集 [英] PyTables read random subset

查看：115 发布时间：2020/11/22 19:10:16 python pandas hdf5 pytables

本文介绍了PyTables读取随机子集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否可以从HDF5中读取行的随机子集(通过pyTables或最好是pandas)?我有一个非常大的数据集，其中包含一百万行，但只需要几千个样本就可以进行分析.那么从压缩的HDF文件中读取数据呢?

Is it possible to read a random subset of rows from HDF5 (via pyTables or, preferably pandas)? I have a very large dataset with million of rows, but only need a sample of few thousands for analysis. And what about reading from compressed HDF file?

推荐答案

使用HDFStore文档为此处

Using HDFStore docs are here, compression docs are here

0.13支持通过构造索引进行随机访问

Random access via a constructed index is supported in 0.13

In [26]: df = DataFrame(np.random.randn(100,2),columns=['A','B'])

In [27]: df.to_hdf('test.h5','df',mode='w',format='table')

In [28]: store = pd.HDFStore('test.h5')

In [29]: nrows = store.get_storer('df').nrows

In [30]: nrows
Out[30]: 100

In [32]: r = np.random.randint(0,nrows,size=10)

In [33]: r
Out[33]: array([69, 28,  8,  2, 14, 51, 92, 25, 82, 64])

In [34]: pd.read_hdf('test.h5','df',where=pd.Index(r))
Out[34]: 
           A         B
69 -0.370739 -0.325433
28  0.155775  0.961421
8   0.101041 -0.047499
2   0.204417  0.470805
14  0.599348  1.174012
51  0.634044 -0.769770
92  0.240077 -0.154110
25  0.367211 -1.027087
82 -0.698825 -0.084713
64 -1.029897 -0.796999

[10 rows x 2 columns]

要包括其他条件，您会这样做:

To include an additional condition you would do like this:

# make sure that we have indexable columns
df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)

# select where the index (an integer index) matches r and A > 0
In [14]: r
Out[14]: array([33, 51, 33, 95, 69, 21, 43, 58, 58, 58])

In [13]: pd.read_hdf('test.h5','df',where='index=r & A>0')
Out[13]: 
           A         B
21  1.456244  0.173443
43  0.174464 -0.444029

[2 rows x 2 columns]

这篇关于PyTables读取随机子集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PyTables读取随机子集 [英] PyTables read random subset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PyTables读取随机子集 [英] PyTables read random subset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭