Python:Pandas系列-为什么使用loc? [英] Python: Pandas Series - Why use loc?
问题描述
为什么我们对熊猫数据框使用"loc"?似乎下面的代码在使用或不使用loc的情况下都以模拟速度编译并运行
Why do we use 'loc' for pandas dataframes? it seems the following code with or without using loc both compile anr run at a simulular speed
%timeit df_user1 = df.loc[df.user_id=='5561']
100 loops, best of 3: 11.9 ms per loop
或
%timeit df_user1_noloc = df[df.user_id=='5561']
100 loops, best of 3: 12 ms per loop
那为什么要使用loc?
So why use loc?
编辑:该问题已被标记为重复问题.但是,尽管 pandas iloc vs ix vs loc的解释?确实提到了*
This has been flagged as a duplicate question. But although pandas iloc vs ix vs loc explanation? does mention that *
您可以仅通过使用数据框的列来进行列检索 getitem :
you can do column retrieval just by using the data frame's getitem:
*
df['time'] # equivalent to df.loc[:, 'time']
它没有说明我们为什么使用loc,尽管它解释了loc的许多功能,但我的具体问题是为什么不完全省略loc"?为此,我在下面接受了非常详细的答案.
it does not say why we use loc, although it does explain lots of features of loc, my specific question is 'why not just omit loc altogether'? for which i have accepted a very detailed answer below.
另外,其他帖子的答案(我认为不是答案)在讨论中非常隐蔽,任何搜索我正在寻找的人的人都将很难找到信息,并且可以更好地为您提供信息我的问题的答案.
Also that other post the answer (which i do not think is an answer) is very hidden in the discussion and any person searching for what i was looking for would find it hard to locate the information and would be much better served by the answer provided to my question.
推荐答案
-
显式优于隐式.
Explicit is better than implicit.
df[boolean_mask]
选择其中boolean_mask
为True的行,但是在某些情况下您可能不希望它出现:df
具有布尔值的列标签:df[boolean_mask]
selects rows whereboolean_mask
is True, but there is a corner case when you might not want it to: whendf
has boolean-valued column labels:In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df Out[229]: False True 0 3 1 1 4 2 2 5 3
您可能要使用
df[[True]]
选择True
列.相反,它会引发一个ValueError
:You might want to use
df[[True]]
to select theTrue
column. Instead it raises aValueError
:In [230]: df[[True]] ValueError: Item wrong length 1 instead of 3.
相对于使用
loc
:In [231]: df.loc[[True]] Out[231]: False True 0 3 1
相反,即使
df2
的结构与上面的df1
几乎相同,以下内容也不会引起ValueError
的出现:In contrast, the following does not raise
ValueError
even though the structure ofdf2
is almost the same asdf1
above:In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2 Out[258]: A B 0 1 3 1 2 4 2 3 5 In [259]: df2[['B']] Out[259]: B 0 3 1 4 2 5
因此,
df[boolean_mask]
并不总是与df.loc[boolean_mask]
相同.即使这可以说是不太可能的用例,我还是建议始终使用df.loc[boolean_mask]
而不是df[boolean_mask]
,因为df.loc
语法的含义是明确的.使用df.loc[indexer]
,您会自动知道df.loc
正在选择行.相反,不清楚df[indexer]
是否在不了解indexer
和df
的详细信息的情况下选择行或列(或提高ValueError
).Thus,
df[boolean_mask]
does not always behave the same asdf.loc[boolean_mask]
. Even though this is arguably an unlikely use case, I would recommend always usingdf.loc[boolean_mask]
instead ofdf[boolean_mask]
because the meaning ofdf.loc
's syntax is explicit. Withdf.loc[indexer]
you know automatically thatdf.loc
is selecting rows. In contrast, it is not clear ifdf[indexer]
will select rows or columns (or raiseValueError
) without knowing details aboutindexer
anddf
.df.loc[row_indexer, column_index]
可以选择和行.df[indexer]
只能根据indexer
中的值类型和df
所具有的列值类型来选择行或列(同样,它们是否为布尔值?).df.loc[row_indexer, column_index]
can select rows and columns.df[indexer]
can only select rows or columns depending on the type of values inindexer
and the type of column valuesdf
has (again, are they boolean?).In [237]: df2.loc[[True,False,True], 'B'] Out[237]: 0 3 2 5 Name: B, dtype: int64
-
将切片传递到
df.loc
时,端点包括在范围内.将切片传递给df[...]
时,该切片将被解释为半开间隔: When a slice is passed to
df.loc
the end-points are included in the range. When a slice is passed todf[...]
, the slice is interpreted as a half-open interval:In [239]: df2.loc[1:2] Out[239]: A B 1 2 4 2 3 5 In [271]: df2[1:2] Out[271]: A B 1 2 4
这篇关于Python:Pandas系列-为什么使用loc?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!