多指标 pandas groupby,忽略一个级别? [英] Multiindexed Pandas groupby, ignore a level?

查看:148
本文介绍了多指标 pandas groupby,忽略一个级别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个多索引DataFrame上运行一个 groupby 操作,类似于这个:

  0 1 ... 
类别特征子特征
cat1特征1子特征1 -0.224487 -0.227524
子特征2 -0.591399 -0.799228
特征2子特征1 1.190110 -1.365895 ...
subfeature2 0.720956 -1.325562
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
feature2 subfeature1 0.234075 -1.362235 ...
subfeature2 0.013875 1.309564
cat3 feature1 subfeature1 NaN NaN
subfeature2 -1.260408 1.559721 ...
feature2 subfeature1 0.419246 0.084386
subfeature2 0.969270 1.493417

... ...。 ..

它可以使用下面的代码生成:

 将pandas导入为pd,numpy为np 
np.random.seed(seed = 90)
results = np.random.randn(3 ,2,2,2)
结果[2,0,0 ,:] = np.nan
结果[1,0,0,1] = np.nan
结果=结果.reshape(( - 1,2))
index = pd.MultiIndex.from_product([[cat1,cat2,cat3],
[feature1,feature2] ,
[subfeature1,subfeature2]],
names = [categories,features,subfeatures])
df = pd.DataFrame(results,index =索引)

我正在尝试仅选择两个子功能数组之间具有最大差异的组大于某个阈值,但是我遇到了 groupby

  df.groupby(level = ['categories','features'])



  {('cat1','feature1'):[('cat1' ,'feature1','subfeature1'),
('cat1','feature1','subfeature2')],
('cat1','feature2'):[('cat1',' ('cat2','feature1'),
('cat1','feature2','subfeature2')],
('cat2','feature1'):[('cat2','feature1' ,'sub2'),
('cat2','feature1','subfeature2')],
('cat2','feature2'):[('cat2','feature2',' ('cat2','feature2','subfeature2')],
('cat3','feature1'):[('cat3','feature1','subfeature1' ),
('cat3','feature1','subfeature2')],
('cat3','feature2'):[('cat3','feature2','subfeature1'),
('cat3','feature2','subfeature2')]}

任何方式来分组,使subfeatu groupby 函数会忽略re级别?原因是我需要 subfeature1 subfeature2 在一起,在不同的组中它们毫无价值。



所以理想情况下,我希望 groupby 返回如下所示:

< ($ cat $'$'$'$'$'$'$'$'$'$'$'$'$'$'$'$' 'cat2','feature2')],
('cat2','feature1'):[('cat2','feature1')],
('cat2','feature2'): [('cat2','feature2')],
('cat3','feature1'):[('cat3','feature1')],
('cat3','feature2' ):[('cat3','feature2')],

我该怎么做? / p>

解决方案

 在[20]中:df.reset_index(level ='subfeatures')。 (level = ['categories','features'])。groups 
Out [20]:
{('cat1','feature1'):[('cat1','feature1'), ('cat1','feature1')],$ b $ ('cat1','feature2'):[('cat1','feature2'),('cat1','feature2')],
('cat2','feature1'):[ ('cat2','feature2'),('cat2','feature1')],
('cat2','feature2'):[('cat2','feature2'), feature2')],
('cat3','feature1'):[('cat3','feature1'),('cat3','feature1')],
('cat3', 'feature2'):[('cat3','feature2'),('cat3','feature2')]}


I'm running a groupby operation on a multiindexed DataFrame similar to this one:

                                        0         1    ...
categories features subfeatures                    
cat1       feature1 subfeature1 -0.224487 -0.227524
                    subfeature2 -0.591399 -0.799228
           feature2 subfeature1  1.190110 -1.365895    ...
                    subfeature2  0.720956 -1.325562
cat2       feature1 subfeature1  1.856932       NaN
                    subfeature2 -1.354258 -0.740473
           feature2 subfeature1  0.234075 -1.362235    ...
                    subfeature2  0.013875  1.309564
cat3       feature1 subfeature1       NaN       NaN
                    subfeature2 -1.260408  1.559721    ...
           feature2 subfeature1  0.419246  0.084386
                    subfeature2  0.969270  1.493417

...                    ...               ...

And it can be generated using the following code:

import pandas as pd, numpy as np
np.random.seed(seed=90)
results = np.random.randn(3,2,2,2)
results[2,0,0,:] = np.nan
results[1,0,0,1] = np.nan
results = results.reshape((-1,2))
index = pd.MultiIndex.from_product([["cat1", "cat2", "cat3"],
                                    ["feature1", "feature2"], 
                                    ["subfeature1", "subfeature2"]], 
                                   names=["categories", "features", "subfeatures"])
df = pd.DataFrame(results, index=index)

I am attempting to select only the groups that have a maximum difference between two subfeature arrays that is greater than a certain threshold, but I'm having trouble with groupby

df.groupby(level=['categories','features'])

This gives me the following groups:

{('cat1', 'feature1'): [('cat1', 'feature1', 'subfeature1'),
  ('cat1', 'feature1', 'subfeature2')],
 ('cat1', 'feature2'): [('cat1', 'feature2', 'subfeature1'),
  ('cat1', 'feature2', 'subfeature2')],
 ('cat2', 'feature1'): [('cat2', 'feature1', 'subfeature1'),
  ('cat2', 'feature1', 'subfeature2')],
 ('cat2', 'feature2'): [('cat2', 'feature2', 'subfeature1'),
  ('cat2', 'feature2', 'subfeature2')],
 ('cat3', 'feature1'): [('cat3', 'feature1', 'subfeature1'),
  ('cat3', 'feature1', 'subfeature2')],
 ('cat3', 'feature2'): [('cat3', 'feature2', 'subfeature1'),
  ('cat3', 'feature2', 'subfeature2')]}

Is there any way to group so that the subfeature level is ignored by the groupby function? The reason is that I need both subfeature1 and subfeature2 together, in separate groups they're worthless.

So ideally I would want the groupby to return something like this:

{('cat1', 'feature1'): [('cat1', 'feature1')],
 ('cat1', 'feature2'): [('cat1', 'feature2')],
 ('cat2', 'feature1'): [('cat2', 'feature1')],
 ('cat2', 'feature2'): [('cat2', 'feature2')],
 ('cat3', 'feature1'): [('cat3', 'feature1')],
 ('cat3', 'feature2'): [('cat3', 'feature2')],

How could I do this?

解决方案

In [20]: df.reset_index(level='subfeatures').groupby(level=['categories','features']).groups
Out[20]: 
{('cat1', 'feature1'): [('cat1', 'feature1'), ('cat1', 'feature1')],
 ('cat1', 'feature2'): [('cat1', 'feature2'), ('cat1', 'feature2')],
 ('cat2', 'feature1'): [('cat2', 'feature1'), ('cat2', 'feature1')],
 ('cat2', 'feature2'): [('cat2', 'feature2'), ('cat2', 'feature2')],
 ('cat3', 'feature1'): [('cat3', 'feature1'), ('cat3', 'feature1')],
 ('cat3', 'feature2'): [('cat3', 'feature2'), ('cat3', 'feature2')]}

这篇关于多指标 pandas groupby,忽略一个级别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆