Python:Pandas 系列 - 为什么使用 loc? [英] Python: Pandas Series - Why use loc?

查看:41
本文介绍了Python:Pandas 系列 - 为什么使用 loc?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么我们对 Pandas 数据框使用loc"?似乎以下代码使用或不使用 loc 都以模拟速度运行

Why do we use 'loc' for pandas dataframes? it seems the following code with or without using loc both compile anr run at a simulular speed

%timeit df_user1 = df.loc[df.user_id=='5561']

100 loops, best of 3: 11.9 ms per loop

%timeit df_user1_noloc = df[df.user_id=='5561']

100 loops, best of 3: 12 ms per loop

那为什么要使用loc呢?

So why use loc?

这已被标记为重复问题.但尽管 pandas iloc vs ix vs loc 解释?确实提到*

This has been flagged as a duplicate question. But although pandas iloc vs ix vs loc explanation? does mention that *

您可以仅使用数据框的getitem:

you can do column retrieval just by using the data frame's getitem:

*

df['time']    # equivalent to df.loc[:, 'time']

它没有说明我们为什么使用 loc,虽然它确实解释了 loc 的许多功能,但我的具体问题是为什么不完全省略 loc"?我已经接受了下面非常详细的答案.

it does not say why we use loc, although it does explain lots of features of loc, my specific question is 'why not just omit loc altogether'? for which i have accepted a very detailed answer below.

此外,其他帖子的答案(我认为不是答案)在讨论中非常隐蔽,任何搜索我正在寻找的内容的人都会发现很难找到信息,并且会得到更好的服务为我的问题提供的答案.

Also that other post the answer (which i do not think is an answer) is very hidden in the discussion and any person searching for what i was looking for would find it hard to locate the information and would be much better served by the answer provided to my question.

推荐答案

  • 显式优于隐式.

    • Explicit is better than implicit.

      df[boolean_mask] 选择 boolean_mask 为 True 的行,但有一个你可能不希望它的极端情况:当 df 具有布尔值列标签:

      df[boolean_mask] selects rows where boolean_mask is True, but there is a corner case when you might not want it to: when df has boolean-valued column labels:

      In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df
      Out[229]: 
         False  True 
      0      3      1
      1      4      2
      2      5      3
      

      您可能想要使用 df[[True]] 来选择 True 列.相反,它引发了一个 ValueError:

      You might want to use df[[True]] to select the True column. Instead it raises a ValueError:

      In [230]: df[[True]]
      ValueError: Item wrong length 1 instead of 3.
      

      对比使用loc:

      In [231]: df.loc[[True]]
      Out[231]: 
         False  True 
      0      3      1
      

      相比之下,即使 df2 的结构与上面的 df1 几乎相同,以下内容也不会引发 ValueError:

      In contrast, the following does not raise ValueError even though the structure of df2 is almost the same as df1 above:

      In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2
      Out[258]: 
         A  B
      0  1  3
      1  2  4
      2  3  5
      
      In [259]: df2[['B']]
      Out[259]: 
         B
      0  3
      1  4
      2  5
      

      因此,df[boolean_mask] 的行为并不总是与 df.loc[boolean_mask] 相同.尽管这可能是一个不太可能的用例,但我还是建议始终使用 df.loc[boolean_mask] 而不是 df[boolean_mask] 因为 df.loc 的语法是明确的.使用 df.loc[indexer],您自动知道 df.loc 正在选择行.相比之下,不清楚 df[indexer] 是否会在不了解 indexer 的详细信息的情况下选择行或列(或引发 ValueError)代码>df.

      Thus, df[boolean_mask] does not always behave the same as df.loc[boolean_mask]. Even though this is arguably an unlikely use case, I would recommend always using df.loc[boolean_mask] instead of df[boolean_mask] because the meaning of df.loc's syntax is explicit. With df.loc[indexer] you know automatically that df.loc is selecting rows. In contrast, it is not clear if df[indexer] will select rows or columns (or raise ValueError) without knowing details about indexer and df.

      df.loc[row_indexer, column_index] 可以选择行列.df[indexer] 只能选择行列,这取决于indexer中值的类型和列值的类型df 有(同样,它们是布尔值吗?).

      df.loc[row_indexer, column_index] can select rows and columns. df[indexer] can only select rows or columns depending on the type of values in indexer and the type of column values df has (again, are they boolean?).

      In [237]: df2.loc[[True,False,True], 'B']
      Out[237]: 
      0    3
      2    5
      Name: B, dtype: int64
      

    • 当切片被传递给 df.loc 时,端点包含在范围内.当切片被传递给 df[...] 时,切片被解释为半开区间:

    • When a slice is passed to df.loc the end-points are included in the range. When a slice is passed to df[...], the slice is interpreted as a half-open interval:

      In [239]: df2.loc[1:2]
      Out[239]: 
         A  B
      1  2  4
      2  3  5
      
      In [271]: df2[1:2]
      Out[271]: 
         A  B
      1  2  4
      

    • 这篇关于Python:Pandas 系列 - 为什么使用 loc?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆