Python:Pandas 系列 - 为什么使用 loc? [英] Python: Pandas Series - Why use loc?
问题描述
为什么我们对 Pandas 数据框使用loc"?似乎以下代码使用或不使用 loc 都以模拟速度运行
Why do we use 'loc' for pandas dataframes? it seems the following code with or without using loc both compile anr run at a simulular speed
%timeit df_user1 = df.loc[df.user_id=='5561']
100 loops, best of 3: 11.9 ms per loop
或
%timeit df_user1_noloc = df[df.user_id=='5561']
100 loops, best of 3: 12 ms per loop
那为什么要使用loc呢?
So why use loc?
这已被标记为重复问题.但尽管 pandas iloc vs ix vs loc 解释?确实提到*
This has been flagged as a duplicate question. But although pandas iloc vs ix vs loc explanation? does mention that *
您可以仅使用数据框的getitem:
you can do column retrieval just by using the data frame's getitem:
*
df['time'] # equivalent to df.loc[:, 'time']
它没有说明我们为什么使用 loc,虽然它确实解释了 loc 的许多功能,但我的具体问题是为什么不完全省略 loc"?我已经接受了下面非常详细的答案.
it does not say why we use loc, although it does explain lots of features of loc, my specific question is 'why not just omit loc altogether'? for which i have accepted a very detailed answer below.
此外,其他帖子的答案(我认为不是答案)在讨论中非常隐蔽,任何搜索我正在寻找的内容的人都会发现很难找到信息,并且会得到更好的服务为我的问题提供的答案.
Also that other post the answer (which i do not think is an answer) is very hidden in the discussion and any person searching for what i was looking for would find it hard to locate the information and would be much better served by the answer provided to my question.
推荐答案
显式优于隐式.
Explicit is better than implicit.
df[boolean_mask]
选择boolean_mask
为 True 的行,但有一个你可能不希望它的极端情况:当df
具有布尔值列标签:df[boolean_mask]
selects rows whereboolean_mask
is True, but there is a corner case when you might not want it to: whendf
has boolean-valued column labels:In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df Out[229]: False True 0 3 1 1 4 2 2 5 3
您可能想要使用
df[[True]]
来选择True
列.相反,它引发了一个ValueError
:You might want to use
df[[True]]
to select theTrue
column. Instead it raises aValueError
:In [230]: df[[True]] ValueError: Item wrong length 1 instead of 3.
对比使用
loc
:In [231]: df.loc[[True]] Out[231]: False True 0 3 1
相比之下,即使
df2
的结构与上面的df1
几乎相同,以下内容也不会引发ValueError
:In contrast, the following does not raise
ValueError
even though the structure ofdf2
is almost the same asdf1
above:In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2 Out[258]: A B 0 1 3 1 2 4 2 3 5 In [259]: df2[['B']] Out[259]: B 0 3 1 4 2 5
因此,
df[boolean_mask]
的行为并不总是与df.loc[boolean_mask]
相同.尽管这可能是一个不太可能的用例,但我还是建议始终使用df.loc[boolean_mask]
而不是df[boolean_mask]
因为df.loc
的语法是明确的.使用df.loc[indexer]
,您自动知道df.loc
正在选择行.相比之下,不清楚df[indexer]
是否会在不了解indexer
和的详细信息的情况下选择行或列(或引发
.ValueError
)代码>dfThus,
df[boolean_mask]
does not always behave the same asdf.loc[boolean_mask]
. Even though this is arguably an unlikely use case, I would recommend always usingdf.loc[boolean_mask]
instead ofdf[boolean_mask]
because the meaning ofdf.loc
's syntax is explicit. Withdf.loc[indexer]
you know automatically thatdf.loc
is selecting rows. In contrast, it is not clear ifdf[indexer]
will select rows or columns (or raiseValueError
) without knowing details aboutindexer
anddf
.df.loc[row_indexer, column_index]
可以选择行和列.df[indexer]
只能选择行或列,这取决于indexer
中值的类型和列值的类型df
有(同样,它们是布尔值吗?).df.loc[row_indexer, column_index]
can select rows and columns.df[indexer]
can only select rows or columns depending on the type of values inindexer
and the type of column valuesdf
has (again, are they boolean?).In [237]: df2.loc[[True,False,True], 'B'] Out[237]: 0 3 2 5 Name: B, dtype: int64
当切片被传递给
df.loc
时,端点包含在范围内.当切片被传递给df[...]
时,切片被解释为半开区间:When a slice is passed to
df.loc
the end-points are included in the range. When a slice is passed todf[...]
, the slice is interpreted as a half-open interval:In [239]: df2.loc[1:2] Out[239]: A B 1 2 4 2 3 5 In [271]: df2[1:2] Out[271]: A B 1 2 4
这篇关于Python:Pandas 系列 - 为什么使用 loc?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!