多指标 pandas groupby,忽略一个级别? [英] Multiindexed Pandas groupby, ignore a level?
问题描述
我在一个多索引DataFrame上运行一个 groupby
操作,类似于这个:
0 1 ...
类别特征子特征
cat1特征1子特征1 -0.224487 -0.227524
子特征2 -0.591399 -0.799228
特征2子特征1 1.190110 -1.365895 ...
subfeature2 0.720956 -1.325562
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
feature2 subfeature1 0.234075 -1.362235 ...
subfeature2 0.013875 1.309564
cat3 feature1 subfeature1 NaN NaN
subfeature2 -1.260408 1.559721 ...
feature2 subfeature1 0.419246 0.084386
subfeature2 0.969270 1.493417
... ...。 ..
它可以使用下面的代码生成:
将pandas导入为pd,numpy为np
np.random.seed(seed = 90)
results = np.random.randn(3 ,2,2,2)
结果[2,0,0 ,:] = np.nan
结果[1,0,0,1] = np.nan
结果=结果.reshape(( - 1,2))
index = pd.MultiIndex.from_product([[cat1,cat2,cat3],
[feature1,feature2] ,
[subfeature1,subfeature2]],
names = [categories,features,subfeatures])
df = pd.DataFrame(results,index =索引)
我正在尝试仅选择两个子功能数组之间具有最大差异的组大于某个阈值,但是我遇到了 groupby
df.groupby(level = ['categories','features'])
{('cat1','feature1'):[('cat1' ,'feature1','subfeature1'),
('cat1','feature1','subfeature2')],
('cat1','feature2'):[('cat1',' ('cat2','feature1'),
('cat1','feature2','subfeature2')],
('cat2','feature1'):[('cat2','feature1' ,'sub2'),
('cat2','feature1','subfeature2')],
('cat2','feature2'):[('cat2','feature2',' ('cat2','feature2','subfeature2')],
('cat3','feature1'):[('cat3','feature1','subfeature1' ),
('cat3','feature1','subfeature2')],
('cat3','feature2'):[('cat3','feature2','subfeature1'),
('cat3','feature2','subfeature2')]}
任何方式来分组,使subfeatu subfeature1
和 subfeature2
在一起,在不同的组中它们毫无价值。
所以理想情况下,我希望 groupby
返回如下所示:
< ($ cat $'$'$'$'$'$'$'$'$'$'$'$'$'$'$'$' 'cat2','feature2')],
('cat2','feature1'):[('cat2','feature1')],
('cat2','feature2'): [('cat2','feature2')],
('cat3','feature1'):[('cat3','feature1')],
('cat3','feature2' ):[('cat3','feature2')],
我该怎么做? / p>
在[20]中:df.reset_index(level ='subfeatures')。 (level = ['categories','features'])。groups
Out [20]:
{('cat1','feature1'):[('cat1','feature1'), ('cat1','feature1')],$ b $ ('cat1','feature2'):[('cat1','feature2'),('cat1','feature2')],
('cat2','feature1'):[ ('cat2','feature2'),('cat2','feature1')],
('cat2','feature2'):[('cat2','feature2'), feature2')],
('cat3','feature1'):[('cat3','feature1'),('cat3','feature1')],
('cat3', 'feature2'):[('cat3','feature2'),('cat3','feature2')]}
I'm running a groupby
operation on a multiindexed DataFrame similar to this one:
0 1 ...
categories features subfeatures
cat1 feature1 subfeature1 -0.224487 -0.227524
subfeature2 -0.591399 -0.799228
feature2 subfeature1 1.190110 -1.365895 ...
subfeature2 0.720956 -1.325562
cat2 feature1 subfeature1 1.856932 NaN
subfeature2 -1.354258 -0.740473
feature2 subfeature1 0.234075 -1.362235 ...
subfeature2 0.013875 1.309564
cat3 feature1 subfeature1 NaN NaN
subfeature2 -1.260408 1.559721 ...
feature2 subfeature1 0.419246 0.084386
subfeature2 0.969270 1.493417
... ... ...
And it can be generated using the following code:
import pandas as pd, numpy as np
np.random.seed(seed=90)
results = np.random.randn(3,2,2,2)
results[2,0,0,:] = np.nan
results[1,0,0,1] = np.nan
results = results.reshape((-1,2))
index = pd.MultiIndex.from_product([["cat1", "cat2", "cat3"],
["feature1", "feature2"],
["subfeature1", "subfeature2"]],
names=["categories", "features", "subfeatures"])
df = pd.DataFrame(results, index=index)
I am attempting to select only the groups that have a maximum difference between two subfeature arrays that is greater than a certain threshold, but I'm having trouble with groupby
df.groupby(level=['categories','features'])
This gives me the following groups:
{('cat1', 'feature1'): [('cat1', 'feature1', 'subfeature1'),
('cat1', 'feature1', 'subfeature2')],
('cat1', 'feature2'): [('cat1', 'feature2', 'subfeature1'),
('cat1', 'feature2', 'subfeature2')],
('cat2', 'feature1'): [('cat2', 'feature1', 'subfeature1'),
('cat2', 'feature1', 'subfeature2')],
('cat2', 'feature2'): [('cat2', 'feature2', 'subfeature1'),
('cat2', 'feature2', 'subfeature2')],
('cat3', 'feature1'): [('cat3', 'feature1', 'subfeature1'),
('cat3', 'feature1', 'subfeature2')],
('cat3', 'feature2'): [('cat3', 'feature2', 'subfeature1'),
('cat3', 'feature2', 'subfeature2')]}
Is there any way to group so that the subfeature level is ignored by the groupby
function? The reason is that I need both subfeature1
and subfeature2
together, in separate groups they're worthless.
So ideally I would want the groupby
to return something like this:
{('cat1', 'feature1'): [('cat1', 'feature1')],
('cat1', 'feature2'): [('cat1', 'feature2')],
('cat2', 'feature1'): [('cat2', 'feature1')],
('cat2', 'feature2'): [('cat2', 'feature2')],
('cat3', 'feature1'): [('cat3', 'feature1')],
('cat3', 'feature2'): [('cat3', 'feature2')],
How could I do this?
In [20]: df.reset_index(level='subfeatures').groupby(level=['categories','features']).groups
Out[20]:
{('cat1', 'feature1'): [('cat1', 'feature1'), ('cat1', 'feature1')],
('cat1', 'feature2'): [('cat1', 'feature2'), ('cat1', 'feature2')],
('cat2', 'feature1'): [('cat2', 'feature1'), ('cat2', 'feature1')],
('cat2', 'feature2'): [('cat2', 'feature2'), ('cat2', 'feature2')],
('cat3', 'feature1'): [('cat3', 'feature1'), ('cat3', 'feature1')],
('cat3', 'feature2'): [('cat3', 'feature2'), ('cat3', 'feature2')]}
这篇关于多指标 pandas groupby,忽略一个级别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!