在 Pandas MultiIndex DataFrame 中选择行 [英] Select rows in pandas MultiIndex DataFrame

查看:47
本文介绍了在 Pandas MultiIndex DataFrame 中选择行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

选择/过滤索引为 MultiIndex 的数据帧?

  • 基于单个值/标签进行切片
  • 基于一个或多个级别的多个标签进行切片
  • 过滤布尔条件和表达式
  • 哪些方法适用于哪些情况

简单假设:

  1. 输入数据框没有重复的索引键
  2. 下面的输入数据框只有两个级别.(此处显示的大多数解决方案都适用于 N 级)

<小时>

示例输入:

<块引用>

mux = pd.MultiIndex.from_arrays([列表('aaaabbbbbbccdddd'),列表('tuvwtuvwtuvwtuvw')], 名字=['一个', '两个'])df = pd.DataFrame({'col': np.arange(len(mux))}, mux)上校一二0你 123bt 45678九10d w 1112131415

问题 1:选择单个项目

如何选择级别一"中有a"的行?

 列一二0你 123

此外,我如何才能在输出中降低一"级?

 列二0你 123

问题 1b
如何在级别二"上切片所有值为t"的行?

 列一二0bt 48d t 12

问题 2:在一个关卡中选择多个值

如何在级别一"中选择与项目b"和d"对应的行?

 列一二bt 45678d w 1112131415

问题 2b
我如何获得二"级中t"和w"对应的所有值?

 列一二03bt 478d w 111215

问题 3:切割单个横截面 (x, y)

如何从 df 检索横截面,即具有特定索引值的单行?具体来说,我如何检索 ('c', 'u') 的横截面,由

给出

 列一二九

问题 4:切片多个横截面 [(a, b), (c, d), ...]

如何选择('c', 'u')('a', 'w')对应的两行?

 列一二九w 3

问题 5:每层切割一个项目

如何检索与级别一"中的a"或级别二"中的t"对应的所有行?

 列一二0你 123bt 48d t 12

问题 6:任意切片

如何切片特定的横截面?对于a"和b",我想选择子级别为u"和v"的所有行,对于d",我想选择子级别为w"的行.

 列一二你 1256d w 1115

<块引用>

问题 7 将使用由数字级别组成的独特设置:

np.random.seed(0)mux2 = pd.MultiIndex.from_arrays([列表('aaaabbbbbbccdddd'),np.random.choice(10, size=16)], 名字=['一个', '两个'])df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)上校一二一个 5 00 13 23 37 49 53 65 72 84 97 10d 6 118 128 131 146 15

问题 7:在多指标的各个级别上按数值不等式进行过滤

如何获取二"级值大于 5 的所有行?

 列一二7 49 57 10d 6 118 128 136 15

<小时>

注意:这篇文章将不会介绍如何创建多索引、如何对它们执行赋值操作或任何与性能相关的讨论(这些将在其他时间单独讨论).

解决方案

多索引/高级索引

<块引用>

注意
这篇文章的结构如下:

  1. OP 中提出的问题将一一解决
  2. 对于每个问题,将展示一种或多种适用于解决此问题并获得预期结果的方法.

注意(很像这个),供有兴趣了解附加功能、实现细节、和其他关于手头主题的粗略信息.这些笔记已通过搜索文档和发现各种晦涩的功能,以及我自己(公认有限)的经验.

所有代码示例均已在 pandas v0.23.4、python3.7 上创建和测试.如果有些事情不清楚,或者事实上不正确,或者如果你没有找到适用于您的用例的解决方案,请随时建议编辑,在评论中要求澄清,或打开一个新的问题,....如适用.

这里介绍一些我们会经常重温的常用习语(以下简称四大习语")

  1. DataFrame.loc - 按标签选择的通用解决方案 (+ pd.IndexSlice 用于涉及切片的更复杂的应用程序)

  2. DataFrame.xs - 从 Series/DataFrame 中提取特定的横截面.

  3. DataFrame.query - 动态指定切片和/或过滤操作(即作为动态评估的表达式.比其他场景更适用于某些场景.另见 文档的这一部分用于在 MultiIndexes 上进行查询.

  4. 使用MultiIndex.get_level_values(通常与 Index.isin,尤其是在过滤多个值时).这在某些情况下也非常有用.

根据四种习语来查看各种切片和过滤问题,以更好地了解可以应用于给定情况的内容,这将是有益的.了解并非所有习语在每种情况下都同样有效(如果有的话),了解这一点非常重要.如果某个习语没有被列为以下问题的潜在解决方案,则意味着该习语不能有效地应用于该问题.


<块引用>

问题 1

如何选择带有a"的行?在一级"?

 列一二0你 123

您可以使用 loc,作为适用于大多数情况的通用解决方案:

df.loc[['a']]

此时,如果你得到

TypeError: 预期的元组,得到 str

这意味着您使用的是旧版本的 Pandas.考虑升级!否则,使用 df.loc[('a', slice(None)), :].

或者,您可以在此处使用 xs,因为我们正在提取单个横截面.注意 levelsaxis 参数(这里可以假设合理的默认值).

df.xs('a', level=0,axis=0, drop_level=False)# df.xs('a', drop_level=False)

这里,需要 drop_level=False 参数来防止 xs 丢弃级别一".在结果中(我们切片的级别).

这里的另一个选择是使用 query:

df.query("one == 'a'")

如果索引没有名称,您需要将查询字符串更改为 "ilevel_0 == 'a'".

最后,使用get_level_values:

df[df.index.get_level_values('one') == 'a']# 如果您的级别未命名,或者您需要按位置(不是标签)选择,# df[df.index.get_level_values(0) == 'a']

<块引用>

此外,我如何才能降低一"级?在输出中?

 列二0你 123

这可以轻松使用任何一种

来完成

df.loc['a'] # 注意单个字符串参数而不是列表.

或者,

df.xs('a', level=0,axis=0, drop_level=True)#df.xs('a')

请注意,我们可以省略 drop_level 参数(默认情况下假定为 True).

<块引用>

注意
您可能会注意到过滤后的 DataFrame 可能仍然具有所有级别,即使它们在打印 DataFrame 时没有显示.例如,

v = df.loc[['a']]打印(五)上校一二0你 123打印(v.index)MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],标签=[[0, 0, 0, 0], [0, 1, 2, 3]],名称=['一个','两个'])

您可以使用 MultiIndex.remove_unused_levels:

v.index = v.index.remove_unused_levels()

<块引用>

print(v.index)MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],标签=[[0, 0, 0, 0], [0, 1, 2, 3]],名称=['一个','两个'])


<块引用>

问题 1b

如何切片所有值为t"的行?在二"层?

 列一二0bt 48d t 12

直觉上,你会想要一些涉及slice():

df.loc[(slice(None), 't'), :]

它确实有效!™但它很笨重.我们可以在此处使用 pd.IndexSlice API 促进更自然的切片语法.

idx = pd.IndexSlicedf.loc[idx[:, 't'], :]

这干净多了.

<块引用>

注意
为什么需要跨列的尾随切片 :?这是因为,loc 可用于沿两个轴(axis=0axis=1).没有明确说明切片的轴是要完成的,操作就变得模棱两可了.请参阅关于切片的文档中的大红框.

如果你想消除任何歧义,loc 接受一个 axis参数:

df.loc(axis=0)[pd.IndexSlice[:, 't']]

如果没有 axis 参数(即,仅通过执行 df.loc[pd.IndexSlice[:, 't']]),则假定切片处于开启状态列,并且在这种情况下会引发 KeyError.

这在 切片器 中有记录.但是,出于本文的目的,我们将明确指定所有轴.

xs,就是

df.xs('t', axis=0, level=1, drop_level=False)

使用query,就是

df.query("two == 't'")# 或者,如果第一层没有名字,# df.query("ilevel_1 == 't'")

最后,使用get_level_values,你可以做到

df[df.index.get_level_values('two') == 't']# 或者,要按位置/整数执行选择,# df[df.index.get_level_values(1) == 't']

效果都一样.


<块引用>

问题 2

如何选择与项目b"相对应的行?和d"在一级"?

 列一二bt 45678d w 1112131415

使用 loc,这是通过指定一个列表以类似的方式完成的.

df.loc[['b', 'd']]

解决上述选择b"的问题和d",你也可以使用query:

items = ['b', 'd']df.query(@items 中的一个")# df.query("one == @items", parser='pandas')# df.query("['b', 'd'] 中的一个")# df.query("one == ['b', 'd']", parser='pandas')

<块引用>

注意
是的,默认解析器是 'pandas',但重要的是要强调这个语法不是传统的 python.这Pandas 解析器生成的解析树与表达.这样做是为了让一些操作更直观指定.欲了解更多信息,请阅读我的帖子使用 pd.eval() 在 Pandas 中进行动态表达式评估.

并且,使用 get_level_values + Index.isin:

df[df.index.get_level_values(one").isin(['b', 'd'])]


<块引用>

问题 2b

我如何获得对应于t"的所有值?和w"在二"级?

 列一二03bt 478d w 111215

使用 loc,这可以与 pd.IndexSlice 结合使用.

df.loc[pd.IndexSlice[:, ['t', 'w']], :]

pd.IndexSlice[:, ['t', 'w']] 中的第一个冒号 : 表示跨第一层切片.随着被查询级别的深度增加,您将需要指定更多切片,每个切片一个.但是,您无需指定被切片之外的更多级别.

使用query,这是

items = ['t', 'w']df.query(@items 中有两个")# df.query("two == @items", parser='pandas')# df.query("['t', 'w']中的两个")# df.query("two == ['t', 'w']", parser='pandas')

使用 get_level_valuesIndex.isin(类似上面):

df[df.index.get_level_values('two').isin(['t', 'w'])]


<块引用>

问题 3

如何检索横截面,即具有特定值的单行对于 df 中的索引?具体来说,我如何检索十字架('c', 'u') 的部分,由

给出

 列一二九

通过指定键元组使用 loc:

df.loc[('c', 'u'), :]

或者,

df.loc[pd.IndexSlice[('c', 'u')]]

<块引用>

注意
此时,您可能会遇到 PerformanceWarning 看起来像这样:

PerformanceWarning:索引超过 lexsort 深度可能会影响性能.

这只是意味着您的索引没有排序.pandas 依赖于被排序的索引(在这种情况下,按字典顺序排列,因为我们正在处理字符串值)以获得最佳搜索和检索.一个快速的解决方法是对你的DataFrame 提前使用 DataFrame.sort_index.如果您打算这样做,从性能的角度来看,这是特别可取的多个这样的查询串联:

df_sort = df.sort_index()df_sort.loc[('c', 'u')]

您也可以使用MultiIndex.is_lexsorted() 检查索引是否是否排序.此函数相应地返回 TrueFalse.您可以调用此函数来确定是否附加排序是否需要步骤.

使用 xs,这再次简单地传递一个元组作为第一个参数,所有其他参数设置为其适当的默认值:

df.xs(('c', 'u'))

使用query,事情变得有点笨拙:

df.query("one == 'c' and two == 'u'")

您现在可以看到,这将相对难以概括.但是对于这个特殊问题还是可以的.

对于跨越多个级别的访问,仍然可以使用 get_level_values,但不建议使用:

m1 = (df.index.get_level_values('one') == 'c')m2 = (df.index.get_level_values('two') == 'u')df[m1 &平方米]


<块引用>

问题 4

如何选择('c', 'u')('a', 'w')对应的两行?

 列一二九w 3

使用loc,这仍然很简单:

df.loc[[('c', 'u'), ('a', 'w')]]# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]

使用query,您将需要通过迭代您的横截面和级别来动态生成查询字符串:

cses = [('c', 'u'), ('a', 'w')]级别 = ['一','二']# 这是一个有用的提前检查.assert all(len(levels) == len(cs) for cs in cses)查询 = '(' + ') 或 ('.join([' 和 '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)])在 cses 中为 cs]) + ')'打印(查询)# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))df.query(查询)

100% 不推荐!但这是可能的.

如果我有多个级别怎么办?
这种情况下的一种选择是使用 droplevel 删除您没有检查的级别,然后使用 isin 测试成员资格,然后对最终结果进行布尔索引.

df[df.index.droplevel(unused_level).isin([('c', 'u'), ('a', 'w')])]


<块引用>

问题 5

如何检索与a"对应的所有行?在一级"或者t"在二"级?

 列一二0你 123bt 48d t 12

这实际上很难用 loc 做到,同时确保正确性并且仍然保持代码清晰.df.loc[pd.IndexSlice['a', 't']] 不正确,解释为 df.loc[pd.IndexSlice[('a', 't')]](即选择横截面).您可能会想到使用 pd.concat 来分别处理每个标签的解决方案:

pd.concat([df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]])上校一二0你 123t 0 # 这看起来对吗?不,不是!bt 48d t 12

但是您会注意到其中一行是重复的.这是因为该行满足两个切片条件,因此出现了两次.你将需要做

v = pd.concat([df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]])v[~v.index.duplicated()]

但是如果您的 DataFrame 固有地包含重复的索引(您想要的),那么这将不会保留它们.谨慎使用.

使用query,这非常简单:

df.query("one == 'a' or two == 't'")

使用get_level_values,这仍然很简单,但没有那么优雅:

m1 = (df.index.get_level_values('one') == 'a')m2 = (df.index.get_level_values('two') == 't')df[m1 |平方米]


<块引用>

问题 6

如何切片特定的横截面?对于a"和b",我想选择所有带有子级别u"的行;和v",和对于d",我想选择子级别为w"的行.

 列一二你 1256d w 1115

这是我添加的一个特殊情况,以帮助理解四种习语的适用性——这是一种它们都不会有效工作的情况,因为切片非常,并且不遵循任何实际模式.

通常,像这样的切片问题需要显式地将键列表传递给 loc.这样做的一种方法是:

keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]df.loc[keys, :]

如果您想节省一些输入,您会发现有一种模式可以将a"、b"切片和它的子级别,所以我们可以将切片任务分成两部分并concat结果:

pd.concat([df.loc[(('a', 'b'), ('u', 'v')), :],df.loc[('d', 'w'), :]],轴=0)

a"的切片规范和b"稍微干净一点 (('a', 'b'), ('u', 'v')) 因为被索引的相同子级别对于每个级别都是相同的.


<块引用>

问题 7

如何获取级别二"中的值的所有行?大于 5?

 列一二7 49 57 10d 6 118 128 136 15

这可以使用query来完成,

df2.query(二 > 5")

还有get_level_values.

df2[df2.index.get_level_values('two') >5]

<块引用>

注意
与此示例类似,我们可以使用这些构造基于任意条件进行过滤.通常,记住 locxs 专门用于基于标签的索引是很有用的,而 queryget_level_values 有助于构建通用条件掩码用于过滤.


<块引用>

奖励问题

如果我需要对 MultiIndex 进行切片怎么办?

实际上,这里的大多数解决方案也适用于列,只是稍作改动.考虑:

np.random.seed(0)mux3 = pd.MultiIndex.from_product([列表('ABCD'),列表('efgh')], 名字=['一个','两个'])df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)打印(df3)一个 A B C D两个 e fg h e f g h e f g h e f g h0 5 0 3 3 7 9 3 5 2 4 7 6 8 8 1 61 7 7 8 1 5 9 8 9 4 3 0 3 5 0 2 32 8 1 3 3 3 7 0 1 9 9 0 4 7 3 2 7

这些是您需要对四个习语进行以下更改才能使它们与列一起使用.

  1. 要使用 loc 进行切片,请使用

     df3.loc[:, ....] # 注意我们如何用`:` 对索引进行切片.

    或者,

     df3.loc[:, pd.IndexSlice[...]]

  2. 要适当地使用 xs,只需传递一个参数 axis=1.

  3. 您可以使用 df.columns.get_level_values 直接访问列级值.然后你需要做类似的事情

     df.loc[:, {condition}]

    其中 {condition} 表示一些使用 columns.get_level_values 构建的条件.

  4. 要使用query,您唯一的选择是转置,查询索引,然后再次转置:

     df3.T.query(...).T

    不推荐,使用其他 3 个选项之一.

What are the most common pandas ways to select/filter rows of a dataframe whose index is a MultiIndex?

  • Slicing based on a single value/label
  • Slicing based on multiple labels from one or more levels
  • Filtering on boolean conditions and expressions
  • Which methods are applicable in what circumstances

Assumptions for simplicity:

  1. input dataframe does not have duplicate index keys
  2. input dataframe below only has two levels. (Most solutions shown here generalize to N levels)


Example input:

mux = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    u      5
    v      6
    w      7
    t      8
c   u      9
    v     10
d   w     11
    t     12
    u     13
    v     14
    w     15

Question 1: Selecting a Single Item

How do I select rows having "a" in level "one"?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

Additionally, how would I be able to drop level "one" in the output?

     col
two     
t      0
u      1
v      2
w      3

Question 1b
How do I slice all rows with value "t" on level "two"?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

Question 2: Selecting Multiple Values in a Level

How can I select rows corresponding to items "b" and "d" in level "one"?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

Question 2b
How would I get all values corresponding to "t" and "w" in level "two"?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

Question 3: Slicing a Single Cross Section (x, y)

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by

         col
one two     
c   u      9

Question 4: Slicing Multiple Cross Sections [(a, b), (c, d), ...]

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?

         col
one two     
c   u      9
a   w      3

Question 5: One Item Sliced per Level

How can I retrieve all rows corresponding to "a" in level "one" or "t" in level "two"?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

Question 6: Arbitrary Slicing

How can I slice specific cross sections? For "a" and "b", I would like to select all rows with sub-levels "u" and "v", and for "d", I would like to select rows with sub-level "w".

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

Question 7 will use a unique setup consisting of a numeric level:

np.random.seed(0)
mux2 = pd.MultiIndex.from_arrays([
    list('aaaabbbbbccddddd'),
    np.random.choice(10, size=16)
], names=['one', 'two'])

df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2)

         col
one two     
a   5      0
    0      1
    3      2
    3      3
b   7      4
    9      5
    3      6
    5      7
    2      8
c   4      9
    7     10
d   6     11
    8     12
    8     13
    1     14
    6     15

Question 7: Filtering by numeric inequality on individual levels of the multiindex

How do I get all rows where values in level "two" are greater than 5?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15


Note: This post will not go through how to create MultiIndexes, how to perform assignment operations on them, or any performance related discussions (these are separate topics for another time).

解决方案

MultiIndex / Advanced Indexing

Note
This post will be structured in the following manner:

  1. The questions put forth in the OP will be addressed, one by one
  2. For each question, one or more methods applicable to solving this problem and getting the expected result will be demonstrated.

Notes (much like this one) will be included for readers interested in learning about additional functionality, implementation details, and other info cursory to the topic at hand. These notes have been compiled through scouring the docs and uncovering various obscure features, and from my own (admittedly limited) experience.

All code samples have created and tested on pandas v0.23.4, python3.7. If something is not clear, or factually incorrect, or if you did not find a solution applicable to your use case, please feel free to suggest an edit, request clarification in the comments, or open a new question, ....as applicable.

Here is an introduction to some common idioms (henceforth referred to as the Four Idioms) we will be frequently re-visiting

  1. DataFrame.loc - A general solution for selection by label (+ pd.IndexSlice for more complex applications involving slices)

  2. DataFrame.xs - Extract a particular cross section from a Series/DataFrame.

  3. DataFrame.query - Specify slicing and/or filtering operations dynamically (i.e., as an expression that is evaluated dynamically. Is more applicable to some scenarios than others. Also see this section of the docs for querying on MultiIndexes.

  4. Boolean indexing with a mask generated using MultiIndex.get_level_values (often in conjunction with Index.isin, especially when filtering with multiple values). This is also quite useful in some circumstances.

It will be beneficial to look at the various slicing and filtering problems in terms of the Four Idioms to gain a better understanding what can be applied to a given situation. It is very important to understand that not all of the idioms will work equally well (if at all) in every circumstance. If an idiom has not been listed as a potential solution to a problem below, that means that idiom cannot be applied to that problem effectively.


Question 1

How do I select rows having "a" in level "one"?

         col
one two     
a   t      0
    u      1
    v      2
    w      3

You can use loc, as a general purpose solution applicable to most situations:

df.loc[['a']]

At this point, if you get

TypeError: Expected tuple, got str

That means you're using an older version of pandas. Consider upgrading! Otherwise, use df.loc[('a', slice(None)), :].

Alternatively, you can use xs here, since we are extracting a single cross section. Note the levels and axis arguments (reasonable defaults can be assumed here).

df.xs('a', level=0, axis=0, drop_level=False)
# df.xs('a', drop_level=False)

Here, the drop_level=False argument is needed to prevent xs from dropping level "one" in the result (the level we sliced on).

Yet another option here is using query:

df.query("one == 'a'")

If the index did not have a name, you would need to change your query string to be "ilevel_0 == 'a'".

Finally, using get_level_values:

df[df.index.get_level_values('one') == 'a']
# If your levels are unnamed, or if you need to select by position (not label),
# df[df.index.get_level_values(0) == 'a']

Additionally, how would I be able to drop level "one" in the output?

     col
two     
t      0
u      1
v      2
w      3

This can be easily done using either

df.loc['a'] # Notice the single string argument instead the list.

Or,

df.xs('a', level=0, axis=0, drop_level=True)
# df.xs('a')

Notice that we can omit the drop_level argument (it is assumed to be True by default).

Note
You may notice that a filtered DataFrame may still have all the levels, even if they do not show when printing the DataFrame out. For example,

v = df.loc[['a']]
print(v)
         col
one two     
a   t      0
    u      1
    v      2
    w      3

print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])

You can get rid of these levels using MultiIndex.remove_unused_levels:

v.index = v.index.remove_unused_levels()

print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
           labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
           names=['one', 'two'])


Question 1b

How do I slice all rows with value "t" on level "two"?

         col
one two     
a   t      0
b   t      4
    t      8
d   t     12

Intuitively, you would want something involving slice():

df.loc[(slice(None), 't'), :]

It Just Works!™ But it is clunky. We can facilitate a more natural slicing syntax using the pd.IndexSlice API here.

idx = pd.IndexSlice
df.loc[idx[:, 't'], :]

This is much, much cleaner.

Note
Why is the trailing slice : across the columns required? This is because, loc can be used to select and slice along both axes (axis=0 or axis=1). Without explicitly making it clear which axis the slicing is to be done on, the operation becomes ambiguous. See the big red box in the documentation on slicing.

If you want to remove any shade of ambiguity, loc accepts an axis parameter:

df.loc(axis=0)[pd.IndexSlice[:, 't']]

Without the axis parameter (i.e., just by doing df.loc[pd.IndexSlice[:, 't']]), slicing is assumed to be on the columns, and a KeyError will be raised in this circumstance.

This is documented in slicers. For the purpose of this post, however, we will explicitly specify all axes.

With xs, it is

df.xs('t', axis=0, level=1, drop_level=False)

With query, it is

df.query("two == 't'")
# Or, if the first level has no name, 
# df.query("ilevel_1 == 't'") 

And finally, with get_level_values, you may do

df[df.index.get_level_values('two') == 't']
# Or, to perform selection by position/integer,
# df[df.index.get_level_values(1) == 't']

All to the same effect.


Question 2

How can I select rows corresponding to items "b" and "d" in level "one"?

         col
one two     
b   t      4
    u      5
    v      6
    w      7
    t      8
d   w     11
    t     12
    u     13
    v     14
    w     15

Using loc, this is done in a similar fashion by specifying a list.

df.loc[['b', 'd']]

To solve the above problem of selecting "b" and "d", you can also use query:

items = ['b', 'd']
df.query("one in @items")
# df.query("one == @items", parser='pandas')
# df.query("one in ['b', 'd']")
# df.query("one == ['b', 'd']", parser='pandas')

Note
Yes, the default parser is 'pandas', but it is important to highlight this syntax isn't conventionally python. The Pandas parser generates a slightly different parse tree from the expression. This is done to make some operations more intuitive to specify. For more information, please read my post on Dynamic Expression Evaluation in pandas using pd.eval().

And, with get_level_values + Index.isin:

df[df.index.get_level_values("one").isin(['b', 'd'])]


Question 2b

How would I get all values corresponding to "t" and "w" in level "two"?

         col
one two     
a   t      0
    w      3
b   t      4
    w      7
    t      8
d   w     11
    t     12
    w     15

With loc, this is possible only in conjuction with pd.IndexSlice.

df.loc[pd.IndexSlice[:, ['t', 'w']], :] 

The first colon : in pd.IndexSlice[:, ['t', 'w']] means to slice across the first level. As the depth of the level being queried increases, you will need to specify more slices, one per level being sliced across. You will not need to specify more levels beyond the one being sliced, however.

With query, this is

items = ['t', 'w']
df.query("two in @items")
# df.query("two == @items", parser='pandas') 
# df.query("two in ['t', 'w']")
# df.query("two == ['t', 'w']", parser='pandas')

With get_level_values and Index.isin (similar to above):

df[df.index.get_level_values('two').isin(['t', 'w'])]


Question 3

How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by

         col
one two     
c   u      9

Use loc by specifying a tuple of keys:

df.loc[('c', 'u'), :]

Or,

df.loc[pd.IndexSlice[('c', 'u')]]

Note
At this point, you may run into a PerformanceWarning that looks like this:

PerformanceWarning: indexing past lexsort depth may impact performance.

This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort your DataFrame in advance using DataFrame.sort_index. This is especially desirable from a performance standpoint if you plan on doing multiple such queries in tandem:

df_sort = df.sort_index()
df_sort.loc[('c', 'u')]

You can also use MultiIndex.is_lexsorted() to check whether the index is sorted or not. This function returns True or False accordingly. You can call this function to determine whether an additional sorting step is required or not.

With xs, this is again simply passing a single tuple as the first argument, with all other arguments set to their appropriate defaults:

df.xs(('c', 'u'))

With query, things become a bit clunky:

df.query("one == 'c' and two == 'u'")

You can see now that this is going to be relatively difficult to generalize. But is still OK for this particular problem.

With accesses spanning multiple levels, get_level_values can still be used, but is not recommended:

m1 = (df.index.get_level_values('one') == 'c')
m2 = (df.index.get_level_values('two') == 'u')
df[m1 & m2]


Question 4

How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?

         col
one two     
c   u      9
a   w      3

With loc, this is still as simple as:

df.loc[[('c', 'u'), ('a', 'w')]]
# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]

With query, you will need to dynamically generate a query string by iterating over your cross sections and levels:

cses = [('c', 'u'), ('a', 'w')]
levels = ['one', 'two']
# This is a useful check to make in advance.
assert all(len(levels) == len(cs) for cs in cses) 

query = '(' + ') or ('.join([
    ' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)]) 
    for cs in cses
]) + ')'

print(query)
# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))

df.query(query)

100% DO NOT RECOMMEND! But it is possible.

What if I have multiple levels?
One option in this scenario would be to use droplevel to drop the levels you're not checking, then use isin to test membership, and then boolean index on the final result.

df[df.index.droplevel(unused_level).isin([('c', 'u'), ('a', 'w')])]


Question 5

How can I retrieve all rows corresponding to "a" in level "one" or "t" in level "two"?

         col
one two     
a   t      0
    u      1
    v      2
    w      3
b   t      4
    t      8
d   t     12

This is actually very difficult to do with loc while ensuring correctness and still maintaining code clarity. df.loc[pd.IndexSlice['a', 't']] is incorrect, it is interpreted as df.loc[pd.IndexSlice[('a', 't')]] (i.e., selecting a cross section). You may think of a solution with pd.concat to handle each label separately:

pd.concat([
    df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])

         col
one two     
a   t      0
    u      1
    v      2
    w      3
    t      0   # Does this look right to you? No, it isn't!
b   t      4
    t      8
d   t     12

But you'll notice one of the rows is duplicated. This is because that row satisfied both slicing conditions, and so appeared twice. You will instead need to do

v = pd.concat([
        df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
v[~v.index.duplicated()]

But if your DataFrame inherently contains duplicate indices (that you want), then this will not retain them. Use with extreme caution.

With query, this is stupidly simple:

df.query("one == 'a' or two == 't'")

With get_level_values, this is still simple, but not as elegant:

m1 = (df.index.get_level_values('one') == 'a')
m2 = (df.index.get_level_values('two') == 't')
df[m1 | m2] 


Question 6

How can I slice specific cross sections? For "a" and "b", I would like to select all rows with sub-levels "u" and "v", and for "d", I would like to select rows with sub-level "w".

         col
one two     
a   u      1
    v      2
b   u      5
    v      6
d   w     11
    w     15

This is a special case that I've added to help understand the applicability of the Four Idioms—this is one case where none of them will work effectively, since the slicing is very specific, and does not follow any real pattern.

Usually, slicing problems like this will require explicitly passing a list of keys to loc. One way of doing this is with:

keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]
df.loc[keys, :]

If you want to save some typing, you will recognise that there is a pattern to slicing "a", "b" and its sublevels, so we can separate the slicing task into two portions and concat the result:

pd.concat([
     df.loc[(('a', 'b'), ('u', 'v')), :], 
     df.loc[('d', 'w'), :]
   ], axis=0)

Slicing specification for "a" and "b" is slightly cleaner (('a', 'b'), ('u', 'v')) because the same sub-levels being indexed are the same for each level.


Question 7

How do I get all rows where values in level "two" are greater than 5?

         col
one two     
b   7      4
    9      5
c   7     10
d   6     11
    8     12
    8     13
    6     15

This can be done using query,

df2.query("two > 5")

And get_level_values.

df2[df2.index.get_level_values('two') > 5]

Note
Similar to this example, we can filter based on any arbitrary condition using these constructs. In general, it is useful to remember that loc and xs are specifically for label-based indexing, while query and get_level_values are helpful for building general conditional masks for filtering.


Bonus Question

What if I need to slice a MultiIndex column?

Actually, most solutions here are applicable to columns as well, with minor changes. Consider:

np.random.seed(0)
mux3 = pd.MultiIndex.from_product([
        list('ABCD'), list('efgh')
], names=['one','two'])

df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)
print(df3)

one  A           B           C           D         
two  e  f  g  h  e  f  g  h  e  f  g  h  e  f  g  h
0    5  0  3  3  7  9  3  5  2  4  7  6  8  8  1  6
1    7  7  8  1  5  9  8  9  4  3  0  3  5  0  2  3
2    8  1  3  3  3  7  0  1  9  9  0  4  7  3  2  7

These are the following changes you will need to make to the Four Idioms to have them working with columns.

  1. To slice with loc, use

     df3.loc[:, ....] # Notice how we slice across the index with `:`. 
    

    or,

     df3.loc[:, pd.IndexSlice[...]]
    

  2. To use xs as appropriate, just pass an argument axis=1.

  3. You can access the column level values directly using df.columns.get_level_values. You will then need to do something like

     df.loc[:, {condition}] 
    

    Where {condition} represents some condition built using columns.get_level_values.

  4. To use query, your only option is to transpose, query on the index, and transpose again:

     df3.T.query(...).T
    

    Not recommended, use one of the other 3 options.

这篇关于在 Pandas MultiIndex DataFrame 中选择行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆