通过多索引子集从 pandas 中选择行 [英] Selecting rows from pandas by subset of multiindex
问题描述
我在pandas中有一个多索引数据帧,索引中有4列,还有一些数据列.下面是一个示例:
I have a multiindex dataframe in pandas, with 4 columns in the index, and some columns of data. An example is below:
import pandas as pd
import numpy as np
cnames = ['K1', 'K2', 'K3', 'K4', 'D1', 'D2']
rdata = pd.DataFrame(np.random.randint(1, 3, size=(8, len(cnames))), columns=cnames)
rdata.set_index(cnames[:4], inplace=True)
rdata.sortlevel(inplace=True)
print(rdata)
D1 D2
K1 K2 K3 K4
1 1 1 1 1 2
1 1 2
2 1 2 1
2 1 2 2 1
2 1 2 1
2 1 2 2 2 1
2 1 2 1 1
2 1 1
[8 rows x 2 columns]
我想做的是选择在K3级别上正好有2个值的行.不是2行,而是两个不同的值.我已经找到了如何为我想要的东西生成一种面具:
What I want to do is select the rows where there are exactly 2 values at the K3 level. Not 2 rows, but two distinct values. I've found how to generate a sort of mask for what I want:
filterFunc = lambda x: len(set(x.index.get_level_values('K3'))) == 2
mask = rdata.groupby(level=cnames[:2]).apply(filterFunc)
print(mask)
K1 K2
1 1 True
2 True
2 1 False
2 False
dtype: bool
我希望由于rdata.loc[1, 2]
允许您仅对索引的一部分进行匹配,因此可以使用这样的布尔矢量来做同样的事情.不幸的是,rdata.loc[mask]
失败,并显示IndexingError: Unalignable boolean Series key provided
.
And I'd hoped that since rdata.loc[1, 2]
allows you to match on just part of the index, it would be possible to do the same thing with a boolean vector like this. Unfortunately, rdata.loc[mask]
fails with IndexingError: Unalignable boolean Series key provided
.
这个问题似乎很相似,但是给出的答案对其他任何东西都无效比顶级索引要高,因为index.get_level_values仅在单个级别上起作用,而不在多个级别上起作用.
This question seemed similar, but the answer given there doesn't work for anything other than the top level index, since index.get_level_values only works on a single level, not multiple ones.
遵循建议此处我设法用
rdata[[mask.loc[k1, k2] for k1, k2, k3, k4 in rdata.index]]
然而,
都使用len(set(index.get_level_values(...)))
获取不同值的计数,然后通过遍历每一行来构建布尔向量,这更像是我正在为框架而战,以实现在多索引设置中看起来像是简单任务的事情.有更好的解决方案吗?
however, both getting the count of distinct values using len(set(index.get_level_values(...)))
and building the boolean vector afterwards by iterating over every row feels more like I'm fighting the framework to achieve something that seems like a simple task in a multiindex setup. Is there a better solution?
这是使用熊猫0.13.1.
This is using pandas 0.13.1.
推荐答案
There might be something better, but you could at least bypass defining mask
by using groupby-filter:
rdata.groupby(level=cnames[:2]).filter(
lambda grp: (grp.index.get_level_values('K3')
.unique().size) == 2)
Out[83]:
D1 D2
K1 K2 K3 K4
1 1 1 1 1 2
1 1 2
2 1 2 1
2 1 2 2 1
2 1 2 1
[5 rows x 2 columns]
比我以前的建议要快.对于小型DataFrame来说确实很好:
It is faster than my previous suggestions. It does really well for small DataFrames:
In [84]: %timeit rdata.groupby(level=cnames[:2]).filter(lambda grp: grp.index.get_level_values('K3').unique().size == 2)
100 loops, best of 3: 3.84 ms per loop
In [76]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: grp.groupby(level=['K3']).ngroups == 2)
100 loops, best of 3: 11.9 ms per loop
In [77]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: len(set(grp.index.get_level_values('K3'))) == 2)
100 loops, best of 3: 13.4 ms per loop
,对于大型DataFrame来说,它仍然是最快的,尽管幅度不大:
and is still the fastest for large DataFrames, though not by as much:
In [78]: rdata2 = pd.concat([rdata]*100000)
In [85]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: grp.index.get_level_values('K3').unique().size == 2)
1 loops, best of 3: 756 ms per loop
In [79]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: grp.groupby(level=['K3']).ngroups == 2)
1 loops, best of 3: 772 ms per loop
In [80]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: len(set(grp.index.get_level_values('K3'))) == 2)
1 loops, best of 3: 1 s per loop
这篇关于通过多索引子集从 pandas 中选择行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!