pandas /PyTable中的索引和数据列 [英] Indexing and Data Columns in Pandas/PyTables
问题描述
http://pandas.pydata.org/pandas-docs/stable/io.html #indexing
我对Pandas HDF5 IO中的数据"列的概念感到非常困惑.另外,在谷歌搜索时也很少或几乎没有任何信息.由于我是在涉及HDF5存储的大型项目中研究熊猫的,所以我想弄清楚这些概念.
I'm really confused about this concept of Data columns in Pandas HDF5 IO. Plus there's very little to no information about it to be found on googling it either. Since I'm diving into Pandas in a large project which involves HDF5 storage, I'd like to be clear about such concepts.
文档说:
您可以指定(并建立索引)某些您想要的列 执行查询(除了可索引列之外,您可以 总是查询).例如说您要执行此常见操作 操作,在磁盘上,并仅返回与此查询匹配的帧. 您可以指定data_columns = True以强制将所有列设置为 data_columns
You can designate (and index) certain columns that you want to be able to perform queries (other than the indexable columns, which you can always query). For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query. You can specify data_columns = True to force all columns to be data_columns
这令人困惑:
-
other than the indexable columns, which you can always query
:什么是可索引"列?并非所有列都可索引"吗?这个名词是什么意思?
other than the indexable columns, which you can always query
: What are 'indexable' columns? aren't all columns 'indexable'? What does this term mean?
For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query.
与在Pytable上进行常规查询有何不同?有没有data_columns
索引?
For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query.
How is that different from normal querying on a Pytable; with or without any indexes of data_columns
?
非索引列,索引列和data_column列之间的根本区别是什么?
What is the fundamental difference between a non-indexed, indexed, and data_column column?
推荐答案
您应该尝试一下.
In [22]: df = DataFrame(np.random.randn(5,2),columns=['A','B'])
In [23]: store = pd.HDFStore('test.h5',mode='w')
In [24]: store.append('df_only_indexables',df)
In [25]: store.append('df_with_data_columns',df,data_columns=True)
In [26]: store.append('df_no_index',df,data_columns=True,index=False)
In [27]: store
Out[27]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df_no_index frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
/df_only_indexables frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index])
/df_with_data_columns frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
In [28]: store.close()
-
您将自动获取存储帧的索引作为可查询的列.默认情况下,无法查询其他任何列.
you automatically get the index of the stored frame as a queryable column. By default NO other columns can be queried.
如果指定
data_columns=True
或data_columns=list_of_columns
,则它们将分别存储,然后可以随后查询.If you specify
data_columns=True
ordata_columns=list_of_columns
, then these are stored separately and can then be subsequently queried.如果指定
index=False
,则不会自动为可查询列(例如index
和/或data_columns
)创建PyTables
索引.If you specify
index=False
then aPyTables
index is not automatically created for the queryable column (eg. theindex
and/ordata_columns
).要查看正在创建的实际索引(
PyTables
索引),请参见下面的输出.colindexes
定义哪些列具有创建的实际PyTables
索引. (我已将其部分截断了.)To see the actual indexes being created (the
PyTables
indexes), see the output below.colindexes
defines which columns have an actualPyTables
index created. (I have truncated it somewhat)./df_no_index/table (Table(5,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "A": Float64Col(shape=(), dflt=0.0, pos=1), "B": Float64Col(shape=(), dflt=0.0, pos=2)} byteorder := 'little' chunkshape := (2730,) /df_no_index/table._v_attrs (AttributeSet), 15 attributes: [A_dtype := 'float64', A_kind := ['A'], B_dtype := 'float64', B_kind := ['B'], CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0.0, FIELD_1_NAME := 'A', FIELD_2_FILL := 0.0, FIELD_2_NAME := 'B', NROWS := 5, TITLE := '', VERSION := '2.7', index_kind := 'integer'] /df_only_indexables/table (Table(5,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)} byteorder := 'little' chunkshape := (2730,) autoindex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_csi=False} /df_only_indexables/table._v_attrs (AttributeSet), 11 attributes: [CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0.0, FIELD_1_NAME := 'values_block_0', NROWS := 5, TITLE := '', VERSION := '2.7', index_kind := 'integer', values_block_0_dtype := 'float64', values_block_0_kind := ['A', 'B']] /df_with_data_columns/table (Table(5,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "A": Float64Col(shape=(), dflt=0.0, pos=1), "B": Float64Col(shape=(), dflt=0.0, pos=2)} byteorder := 'little' chunkshape := (2730,) autoindex := True colindexes := { "A": Index(6, medium, shuffle, zlib(1)).is_csi=False, "index": Index(6, medium, shuffle, zlib(1)).is_csi=False, "B": Index(6, medium, shuffle, zlib(1)).is_csi=False} /df_with_data_columns/table._v_attrs (AttributeSet), 15 attributes: [A_dtype := 'float64', A_kind := ['A'], B_dtype := 'float64', B_kind := ['B'], CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0.0, FIELD_1_NAME := 'A', FIELD_2_FILL := 0.0, FIELD_2_NAME := 'B', NROWS := 5, TITLE := '', VERSION := '2.7', index_kind := 'integer']
因此,如果要查询列,请将其设置为
data_column
.如果您不这样做,则它们将按dtype(更快/更少的空间)存储在块中.So if you want to query a column, make it a
data_column
. If you don't then they will be stored in blocks by dtype (faster / less space).通常,您总是希望为要检索的列建立索引,但是,如果要创建一个索引,然后将多个文件附加到一个存储中,通常会关闭索引的创建并在最后执行(因为这样做非常昂贵)随心所欲地创建).
You normally always want to index a column for retrieval, BUT, if you are creating and then appending multiple files to a single store, you usually turn off the index creation and do it at the end (as this is pretty expensive to create as you go).
有关问题的详细信息,请参见食谱
See the cookbook for a menagerie of questions.
这篇关于 pandas /PyTable中的索引和数据列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!