pandas /PyTable中的索引和数据列 [英] Indexing and Data Columns in Pandas/PyTables

查看:78
本文介绍了 pandas /PyTable中的索引和数据列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

http://pandas.pydata.org/pandas-docs/stable/io.html #indexing

我对Pandas HDF5 IO中的数据"列的概念感到非常困惑.另外,在谷歌搜索时也很少或几乎没有任何信息.由于我是在涉及HDF5存储的大型项目中研究熊猫的,所以我想弄清楚这些概念.

I'm really confused about this concept of Data columns in Pandas HDF5 IO. Plus there's very little to no information about it to be found on googling it either. Since I'm diving into Pandas in a large project which involves HDF5 storage, I'd like to be clear about such concepts.

文档说:

您可以指定(并建立索引)某些您想要的列 执行查询(除了可索引列之外,您可以 总是查询).例如说您要执行此常见操作 操作,在磁盘上,并仅返回与此查询匹配的帧. 您可以指定data_columns = True以强制将所有列设置为 data_columns

You can designate (and index) certain columns that you want to be able to perform queries (other than the indexable columns, which you can always query). For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query. You can specify data_columns = True to force all columns to be data_columns

这令人困惑:

  1. other than the indexable columns, which you can always query:什么是可索引"列?并非所有列都可索引"吗?这个名词是什么意思?

  1. other than the indexable columns, which you can always query: What are 'indexable' columns? aren't all columns 'indexable'? What does this term mean?

For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query.与在Pytable上进行常规查询有何不同?有没有data_columns索引?

For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query. How is that different from normal querying on a Pytable; with or without any indexes of data_columns?

非索引列,索引列和data_column列之间的根本区别是什么?

What is the fundamental difference between a non-indexed, indexed, and data_column column?

推荐答案

您应该尝试一下.

In [22]: df = DataFrame(np.random.randn(5,2),columns=['A','B'])

In [23]: store = pd.HDFStore('test.h5',mode='w')

In [24]: store.append('df_only_indexables',df)

In [25]: store.append('df_with_data_columns',df,data_columns=True)

In [26]: store.append('df_no_index',df,data_columns=True,index=False)

In [27]: store
Out[27]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df_no_index                     frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
/df_only_indexables              frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index])          
/df_with_data_columns            frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])

In [28]: store.close()

  • 您将自动获取存储帧的索引作为可查询的列.默认情况下,无法查询其他任何列.

    • you automatically get the index of the stored frame as a queryable column. By default NO other columns can be queried.

      如果指定data_columns=Truedata_columns=list_of_columns,则它们将分别存储,然后可以随后查询.

      If you specify data_columns=True or data_columns=list_of_columns, then these are stored separately and can then be subsequently queried.

      如果指定index=False,则不会自动为可查询列(例如index和/或data_columns)创建PyTables索引.

      If you specify index=False then a PyTables index is not automatically created for the queryable column (eg. the index and/or data_columns).

      要查看正在创建的实际索引(PyTables索引),请参见下面的输出. colindexes定义哪些列具有创建的实际PyTables索引. (我已将其部分截断了.)

      To see the actual indexes being created (the PyTables indexes), see the output below. colindexes defines which columns have an actual PyTables index created. (I have truncated it somewhat).

      /df_no_index/table (Table(5,)) ''
        description := {
        "index": Int64Col(shape=(), dflt=0, pos=0),
        "A": Float64Col(shape=(), dflt=0.0, pos=1),
        "B": Float64Col(shape=(), dflt=0.0, pos=2)}
        byteorder := 'little'
        chunkshape := (2730,)
        /df_no_index/table._v_attrs (AttributeSet), 15 attributes:
         [A_dtype := 'float64',
          A_kind := ['A'],
          B_dtype := 'float64',
          B_kind := ['B'],
          CLASS := 'TABLE',
          FIELD_0_FILL := 0,
          FIELD_0_NAME := 'index',
          FIELD_1_FILL := 0.0,
          FIELD_1_NAME := 'A',
          FIELD_2_FILL := 0.0,
          FIELD_2_NAME := 'B',
          NROWS := 5,
          TITLE := '',
          VERSION := '2.7',
          index_kind := 'integer']
      /df_only_indexables/table (Table(5,)) ''
        description := {
        "index": Int64Col(shape=(), dflt=0, pos=0),
        "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
        byteorder := 'little'
        chunkshape := (2730,)
        autoindex := True
        colindexes := {
          "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
        /df_only_indexables/table._v_attrs (AttributeSet), 11 attributes:
         [CLASS := 'TABLE',
          FIELD_0_FILL := 0,
          FIELD_0_NAME := 'index',
          FIELD_1_FILL := 0.0,
          FIELD_1_NAME := 'values_block_0',
          NROWS := 5,
          TITLE := '',
          VERSION := '2.7',
          index_kind := 'integer',
          values_block_0_dtype := 'float64',
          values_block_0_kind := ['A', 'B']]
      /df_with_data_columns/table (Table(5,)) ''
        description := {
        "index": Int64Col(shape=(), dflt=0, pos=0),
        "A": Float64Col(shape=(), dflt=0.0, pos=1),
        "B": Float64Col(shape=(), dflt=0.0, pos=2)}
        byteorder := 'little'
        chunkshape := (2730,)
        autoindex := True
        colindexes := {
          "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
          "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
          "B": Index(6, medium, shuffle, zlib(1)).is_csi=False}
        /df_with_data_columns/table._v_attrs (AttributeSet), 15 attributes:
         [A_dtype := 'float64',
          A_kind := ['A'],
          B_dtype := 'float64',
          B_kind := ['B'],
          CLASS := 'TABLE',
          FIELD_0_FILL := 0,
          FIELD_0_NAME := 'index',
          FIELD_1_FILL := 0.0,
          FIELD_1_NAME := 'A',
          FIELD_2_FILL := 0.0,
          FIELD_2_NAME := 'B',
          NROWS := 5,
          TITLE := '',
          VERSION := '2.7',
          index_kind := 'integer']
      

      因此,如果要查询列,请将其设置为data_column.如果您不这样做,则它们将按dtype(更快/更少的空间)存储在块中.

      So if you want to query a column, make it a data_column. If you don't then they will be stored in blocks by dtype (faster / less space).

      通常,您总是希望为要检索的列建立索引,但是,如果要创建一个索引,然后将多个文件附加到一个存储中,通常会关闭索引的创建并在最后执行(因为这样做非常昂贵)随心所欲地创建).

      You normally always want to index a column for retrieval, BUT, if you are creating and then appending multiple files to a single store, you usually turn off the index creation and do it at the end (as this is pretty expensive to create as you go).

      有关问题的详细信息,请参见食谱

      See the cookbook for a menagerie of questions.

      这篇关于 pandas /PyTable中的索引和数据列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆