子集 pandas 数据框的最佳方法 [英] Best way to subset a pandas dataframe
问题描述
嘿,我是Pandas的新手,我刚遇到df.query()
.
Hey I'm new to Pandas and I just came across df.query()
.
当您可以使用方括号表示法直接过滤数据框时,为什么人们会使用df.query()
?官方的熊猫教程似乎也更喜欢后一种方法.
Why people would use df.query()
when you can directly filter your Dataframes using brackets notation ? The official pandas tutorial also seems to prefer the latter approach.
带有方括号表示法:
df[df['age'] <= 21]
使用熊猫查询方法:
df.query('age <= 21')
除了已经提到的某些样式或灵活性差异之外,一种规范上的首选是更好的选择-即在大型数据帧上执行操作时是否能做到这一点?
Besides some of the stylistic or flexibility differences that have been mentioned, is one canonically preferred - namely for performance of operations on large dataframes?
推荐答案
请考虑以下示例DF:
In [307]: df
Out[307]:
sex age name
0 M 40 Max
1 F 35 Anna
2 M 29 Joe
3 F 18 Maria
4 F 23 Natalie
有很多很好的理由偏爱.query()
方法.
There are quite a few good reasons to prefer .query()
method.
-
与布尔索引相比,它可能更短,更简洁:
it might be much shorter and cleaner compared to boolean indexing:
In [308]: df.query("20 <= age <= 30 and sex=='F'")
Out[308]:
sex age name
4 F 23 Natalie
In [309]: df[(df['age']>=20) & (df['age']<=30) & (df['sex']=='F')]
Out[309]:
sex age name
4 F 23 Natalie
您可以以编程方式准备条件(查询):
you can prepare conditions (queries) programmatically:
In [315]: conditions = {'name':'Joe', 'sex':'M'}
In [316]: q = ' and '.join(['{}=="{}"'.format(k,v) for k,v in conditions.items()])
In [317]: q
Out[317]: 'name=="Joe" and sex=="M"'
In [318]: df.query(q)
Out[318]:
sex age name
2 M 29 Joe
PS还有一些缺点:
- 对于包含空格或仅由数字组成的列的列,我们不能使用
.query()
方法 - 并非所有功能都可以应用,或者在某些情况下,我们必须使用
engine='python'
代替默认的engine='numexpr'
(更快)
- we can't use
.query()
method for columns containing spaces or columns that consist only from digits - not all functions can be applied or in some cases we have to use
engine='python'
instead of defaultengine='numexpr'
(which is faster)
注意:Jeff(熊猫的主要贡献者之一,也是熊猫核心团队的成员)
NOTE: Jeff (one of the main Pandas contributors and a member of Pandas core team) once said:
请注意,实际上.query只是一个不错的界面,实际上 它有非常具体的保证,这意味着它的解析方式像 查询语言,而不是完全通用的界面.
Note that in reality .query is just a nice-to-have interface, in fact it has very specific guarantees, meaning its meant to parse like a query language, and not a fully general interface.
这篇关于子集 pandas 数据框的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!