如何将我的 pandas 数据分组概括到3个以上的维度? [英] How can I generalize my pandas data grouping to more than 3 dimensions?

查看:57
本文介绍了如何将我的 pandas 数据分组概括到3个以上的维度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用出色的pandas程序包来处理大量各种气象诊断数据,并且在将数据缝合在一起时很快就用完了尺寸.查看文档,可能是使用MultiIndex可能解决了我的问题,但是我不确定如何将其应用于我的情况-文档显示了使用随机数据和DataFrame创建MultiIndexes的示例,但是而不是具有预先存在的时间序列数据的序列.

I'm using the excellent pandas package to deal with a large amount of varied meteorological diagnostic data and I'm quickly running out of dimensions as I stitch the data together. Looking at the documentation, it may be that using the MultiIndex may solve my problem, but I'm not sure how to apply it to my situation - the documentation shows examples of creating MultiIndexes with random data and DataFrames, but not Series with pre-existing timeseries data.

背景

我正在使用的基本数据结构包含两个主要字段:

The basic data structure I'm using contains two main fields:

  • metadata,这是一个由键值对组成的字典,描述了数字是什么
  • data,这是一个包含数字本身的熊猫数据结构.
  • metadata, which is a dictionary consisting of key-value pairs describing what the numbers are
  • data, which is a pandas data structure containing the numbers themselves.

最小公分母是时间序列数据,因此基本结构具有熊猫Series对象作为data条目,并且metadata字段描述了这些数字的实际含义(例如,10米的矢量RMS误差)实验1)在24小时内对东太平洋进行了大风预报.

The lowest common denominator is timeseries data, so the basic structure has a pandas Series object as the data entry, and the metadata field describes what those numbers actually are (e.g. vector RMS error for 10-meter wind over the Eastern Pacific for a 24-hour forecast from experiment Test1).

我正在考虑采用最小公分母并将各个时间序列粘合在一起,以使结果更有用并允许轻松组合.例如,我可能要查看所有不同的提前期-我有一个过滤例程,该例程将采用我的时间序列,这些时间序列共享相同的元数据条目除了的提前期(例如,实验,区域等). )并返回一个新对象,其中metadata字段仅包含公共条目(即Lead Time已被删除),现在data字段是熊猫DataFrame,其列标签由Lead Time给出价值.我可以再次扩展一下,说我想将得到的帧和 them 一起与另一个变化的条目(例如Experiment)组合在一起,给我一个熊猫Panel.对于我的条目,其中项索引由构成帧中的Experiment元数据值给出,并且对象的新元数据不包含Lead TimeExperiment.

I'm looking at taking that lowest-common-denominator and gluing the various timeseries together to make the results more useful and allow for easy combinations. For instance, I may want to look at all the different lead times - I have a filter routine that will take my timeseries that share the same metadata entries except for lead time (e.g. experiment, region, etc.) and return a new object where the metadata field consists of only the common entries (i.e. Lead Time has been removed) and now the data field is a pandas DataFrame with the column labels given by the Lead Time value. I can extend this again and say I want to take the resulting frames and group them together with only another entry varying (e.g. the Experiment) to give me a pandas Panel. for my entry where the item index is given by the Experiment metadata values from the constituent frames and the object's new metadata does not contain either Lead Time or Experiment.

当我遍历这些复合对象时,对于框架,我有一个iterseries例程,对于面板,我有一个iterframes例程,当我放下一个维度(即,框架中的序列时,跨列变化的提前期将恢复其父 plus Lead Time字段的所有元数据,并使用从列标签中获取的值).效果很好.

When I iterate over these composite objects, I have an iterseries routine for the frame and iterframes routine for the panel that reconstruct the appropriate metadata/data pairing as I drop one dimension (i.e. the series from the frame with lead time varying across the columns will have all the metadata of its parent plus the Lead Time field restored with the value taken from the column label). This works great.

问题

我用完了尺寸(使用Panel最多进行3-D),而且一旦在Panel中对齐所有内容后,我也无法使用dropna之类的东西来删除空列(这导致绘制摘要统计信息时的几个错误).阅读有关将熊猫与更高维度的数据一起使用的信息,导致阅读了有关MultiIndex及其用法的信息.我已经尝试了文档中给出的示例,但是仍然不清楚如何将其应用于我的情况.任何方向都是有用的.我希望能够:

I've run out of dimensions (up to 3-D with a Panel) and I'm also not able to use things like dropna to remove empty columns once everything is aligned in the Panel (this has led to several bugs when plotting summary statistics). Reading about using pandas with higher-dimensional data has led to reading about the MultiIndex and its use. I've tried the examples given in the documentation, but I'm still a little unclear how to apply it to my situation. Any direction would be useful. I'd like to be able to:

  • 将我的基于Series的数据沿任意数量的维度组合到一个多索引的DataFrame中(这很好-它将消除一次调用以创建系列中的帧,然后再次调用以创建系列中的帧)框架中的面板)
  • 遍历生成的多索引DataFrame,删除一个维,以便我可以重置组件元数据.
  • Combine my Series-based data into a multi-indexed DataFrame along an arbitrary number of dimensions (this would be great - it would eliminate one call to create the frames from the series, and then another to create the panels from the frames)
  • Iterate over the resulting multi-indexed DataFrame, dropping a single dimension so I can reset the component metadata.

编辑-添加代码示例

下面的Wes McKinney的回答几乎正是我所需要的-问题在于,一旦我开始将元素分组在一起,就必须从必须使用的系列支持的存储对象到我的DataFrame支持的对象进行初始转换.由Data-Frame支持的类具有以下方法,该方法采用基于系列的对象和元数据字段的列表,这些对象在各列之间会有所不同.

Wes McKinney's answer below is almost exactly what I need - the issue is in the initial translation from the Series-backed storage objects I have to work with to my DataFrame-backed objects once I start grouping elements together. The Data-Frame-backed class has the following method that takes in a list of the series-based objects and the metadata field that will vary across the columns.

@classmethod
def from_list(cls, results_list, column_key):
    """
    Populate object from a list of results that all share the metadata except
    for the field `column_key`.

    """
    # Need two copies of the input results - one for building the object
    # data and one for building the object metadata
    for_data, for_metadata = itertools.tee(results_list)

    self             = cls()
    self.column_key  = column_key
    self.metadata    = next(for_metadata).metadata.copy()
    if column_key in self.metadata:
        del self.metadata[column_key]
    self.data = pandas.DataFrame(dict(((transform(r[column_key]), r.data)
                                        for r in for_data)))
    return self

一旦有了此例程给出的框架,我就可以轻松地应用以下建议的各种操作-当我使用特定的实用程序时,便可以使用names字段 调用concat-这样就无需在内部存储列键的名称 因为它以该索引维度的名称存储在MultiIndex中.

Once I have the frame given by this routine, I can easily apply the various operations suggested below - of particular utility is being able to use the names field when I call concat - this eliminates the need to store the name of the column key internally since it's stored in the MultiIndex as the name of that index dimension.

我希望能够实现下面的解决方案,并且只接受匹配的支持Series的类的列表和键的列表,然后按顺序进行分组.但是,我不知道这些列会提前代表什么,所以:

I'd like to be able to implement the solution below and just take in the list of matching Series-backed classes and a list of keys and do the grouping sequentially. However, I don't know what the columns will be representing ahead of time, so:

  • 将Series数据存储在一维DataFrame中对我来说真的没有意义
  • 我看不到如何从初始系列->框架分组中设置索引的名称和列

推荐答案

我可能建议使用pandas.concat及其keys参数将Series DataFrames粘合在一起以在列中创建MultiIndex:

I might suggest using pandas.concat along with its keys argument to glue together Series DataFrames to create a MultiIndex in the columns:

In [20]: data
Out[20]: 
{'a': 2012-04-16    0
2012-04-17    1
2012-04-18    2
2012-04-19    3
2012-04-20    4
2012-04-21    5
2012-04-22    6
2012-04-23    7
2012-04-24    8
2012-04-25    9
Freq: D,
 'b': 2012-04-16    0
2012-04-17    1
2012-04-18    2
2012-04-19    3
2012-04-20    4
2012-04-21    5
2012-04-22    6
2012-04-23    7
2012-04-24    8
2012-04-25    9
Freq: D,
 'c': 2012-04-16    0
2012-04-17    1
2012-04-18    2
2012-04-19    3
2012-04-20    4
2012-04-21    5
2012-04-22    6
2012-04-23    7
2012-04-24    8
2012-04-25    9
Freq: D}

In [21]: df = pd.concat(data, axis=1, keys=['a', 'b', 'c'])

In [22]: df
Out[22]: 
            a  b  c
2012-04-16  0  0  0
2012-04-17  1  1  1
2012-04-18  2  2  2
2012-04-19  3  3  3
2012-04-20  4  4  4
2012-04-21  5  5  5
2012-04-22  6  6  6
2012-04-23  7  7  7
2012-04-24  8  8  8
2012-04-25  9  9  9

In [23]: df2 = pd.concat([df, df], axis=1, keys=['group1', 'group2'])

In [24]: df2
Out[24]: 
            group1        group2      
                 a  b  c       a  b  c
2012-04-16       0  0  0       0  0  0
2012-04-17       1  1  1       1  1  1
2012-04-18       2  2  2       2  2  2
2012-04-19       3  3  3       3  3  3
2012-04-20       4  4  4       4  4  4
2012-04-21       5  5  5       5  5  5
2012-04-22       6  6  6       6  6  6
2012-04-23       7  7  7       7  7  7
2012-04-24       8  8  8       8  8  8
2012-04-25       9  9  9       9  9  9

您已经:

In [25]: df2['group2']
Out[25]: 
            a  b  c
2012-04-16  0  0  0
2012-04-17  1  1  1
2012-04-18  2  2  2
2012-04-19  3  3  3
2012-04-20  4  4  4
2012-04-21  5  5  5
2012-04-22  6  6  6
2012-04-23  7  7  7
2012-04-24  8  8  8
2012-04-25  9  9  9

甚至

In [27]: df2.xs('b', axis=1, level=1)
Out[27]: 
            group1  group2
2012-04-16       0       0
2012-04-17       1       1
2012-04-18       2       2
2012-04-19       3       3
2012-04-20       4       4
2012-04-21       5       5
2012-04-22       6       6
2012-04-23       7       7
2012-04-24       8       8
2012-04-25       9       9

您可以任意设置多个级别:

You can have arbitrarily many levels:

In [29]: pd.concat([df2, df2], axis=1, keys=['tier1', 'tier2'])
Out[29]: 
             tier1                       tier2                    
            group1        group2        group1        group2      
                 a  b  c       a  b  c       a  b  c       a  b  c
2012-04-16       0  0  0       0  0  0       0  0  0       0  0  0
2012-04-17       1  1  1       1  1  1       1  1  1       1  1  1
2012-04-18       2  2  2       2  2  2       2  2  2       2  2  2
2012-04-19       3  3  3       3  3  3       3  3  3       3  3  3
2012-04-20       4  4  4       4  4  4       4  4  4       4  4  4
2012-04-21       5  5  5       5  5  5       5  5  5       5  5  5
2012-04-22       6  6  6       6  6  6       6  6  6       6  6  6
2012-04-23       7  7  7       7  7  7       7  7  7       7  7  7
2012-04-24       8  8  8       8  8  8       8  8  8       8  8  8
2012-04-25       9  9  9       9  9  9       9  9  9       9  9  9

这篇关于如何将我的 pandas 数据分组概括到3个以上的维度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆