从 Pandas 中的公式动态计算表达式 [英] Dynamically evaluate an expression from a formula in Pandas

查看:15
本文介绍了从 Pandas 中的公式动态计算表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 pd.eval 对一个或多个数据帧列执行算术运算.具体来说,我想移植以下计算公式的代码:

x = 5df2['D'] = df1['A'] + (df1['B'] * x)

...使用 pd.eval 编码.使用 pd.eval 的原因是我想自动化许多工作流,所以动态创建它们对我很有用.

我的两个输入数据帧是:

将pandas导入为pd将 numpy 导入为 npnp.random.seed(0)df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))df1A B C D0 5 0 3 31 7 9 3 52 2 4 7 63 8 8 1 64 7 7 8 1df2A B C D0 5 9 8 91 4 3 0 32 5 0 2 33 8 1 3 34 3 7 0 1

我试图更好地理解 pd.evalengineparser 参数,以确定如何最好地解决我的问题.我已经阅读了文档,但我并不清楚其中的区别.

  1. 应该使用哪些参数来确保我的代码以最高性能运行?
  2. 有没有办法将表达式的结果分配回 df2?
  3. 另外,为了让事情变得更复杂,我如何将 x 作为参数传递到字符串表达式中?

解决方案

您可以使用 1) pd.eval(), 2) df.query(),或 3) df.eval().下面讨论了它们的各种特性和功能.

示例将涉及这些数据帧(除非另有说明).

np.random.seed(0)df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))df3 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))df4 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))


1) pandas.eval

<块引用>

这是丢失的手册";大熊猫文档应该包含.注意: 在讨论的三个函数中,pd.eval 是最重要的.df.evaldf.query 调用pd.eval 在幕后.行为和用法或多或少在三个函数中保持一致,具有一些轻微的语义稍后将重点介绍的变化.本节将介绍所有三个函数通用的功能 - 这包括(但不限于)允许的语法、优先规则关键字参数.

pd.eval 可以计算由变量和/或文字组成的算术表达式.这些表达式必须作为字符串传递.因此,回答问题,如上所述,您可以

x = 5pd.eval(df1.A + (df1.B * x)")

这里需要注意的一些事情:

  1. 整个表达式是一个字符串
  2. df1df2x 指的是全局命名空间中的变量,这些是由 eval 解析表达式时
  3. 使用属性访问器索引访问特定列.您也可以使用 "df1['A'] + (df1['B'] * x)" 达到同样的效果.

我将在下面解释 target=... 属性的部分中解决重新分配的具体问题.但就目前而言,以下是使用 pd.eval 进行有效操作的更简单示例:

pd.eval("df1.A + df2.A") # 有效,返回一个 pd.Series 对象pd.eval("abs(df1) ** .5") # 有效,返回一个 pd.DataFrame 对象

...等等.也以同样的方式支持条件表达式.下面的语句都是有效的表达式,将由引擎评估.

pd.eval(df1 > df2")pd.eval(df1 > 5")pd.eval(df1 

可以在 文档.总之,

<块引用>
  • 除左移 (<<) 和右移 (>>) 运算符之外的算术运算,例如 df + 2 *pi/s ** 4 % 42 - the_golden_ratio
  • 比较操作,包括链式比较,例如 2
  • 布尔运算,例如,df <;df2 和 df3 not df_boollisttuple 文字,例如 [1, 2](1, 2)
  • 属性访问,例如,df.a
  • 下标表达式,例如,df[0]
  • 简单的变量评估,例如,pd.eval('df')(这不是很有用)
  • 数学函数:sin、cos、exp、log、expm1、log1p、sqrt、sinh、cosh、tanh、arcsin、arccos、arctan、arccosh、arcsinh、arctanh、abs和arctan2.

文档的这一部分还指定了不受支持的语法规则,包括 set/dict 文字、if-else 语句、循环和推导式以及生成器表达式.

从列表中,很明显你也可以传递涉及索引的表达式,比如

pd.eval('df1.A * (df1.index > 1)')

1a) 解析器选择:parser=... 参数

pd.eval 在解析表达式字符串以生成语法树时支持两种不同的解析器选项:pandaspython.两者之间的主要区别通过略有不同的优先规则突出显示.

使用默认解析器 pandas,重载的按位运算符 &| 实现了对 pandas 对象的向量化 AND 和 OR 操作将具有与 andor 相同的运算符优先级.所以,

pd.eval("(df1 > df2) & (df3 < df4)")

将与

相同

pd.eval("df1>df2&df3<df4")# pd.eval("df1 > df2 & df3 < df4", parser='pandas')

也一样

pd.eval("df1 > df2 and df3 < df4")

这里,括号是必要的.按照惯例,要做到这一点,需要括号来覆盖按位运算符的更高优先级:

(df1 > df2) &(df3 < df4)

没有那个,我们就结束了

df1 >df2 &df3<df4ValueError:DataFrame 的真值不明确.使用 a.empty、a.bool()、a.item()、a.any() 或 a.all().

如果您想在评估字符串时保持与 Python 的实际运算符优先级规则的一致性,请使用 parser='python'.

pd.eval("(df1 > df2) & (df3 < df4)", parser='python')

这两种解析器之间的另一个区别是==!= 运算符的语义与列表和元组节点的语义,它们的语义与<当使用 'pandas' 解析器时,分别为 code>innot in.例如,

pd.eval("df1 == [1, 2, 3]")

有效,并将以与

相同的语义运行

pd.eval(df1 in [1, 2, 3]")

OTOH, pd.eval("df1 == [1, 2, 3]", parser='python') 将抛出 NotImplementedError 错误.

1b) 后端选择:engine=... 参数

有两个选项 - numexpr(默认)和 python.numexpr 选项使用针对性能进行了优化的 numexpr 后端.

使用 Python 后端,您的表达式的计算类似于将表达式传递给 Python 的 eval 函数.您可以灵活地执行更多内部表达式,例如字符串操作.

df = pd.DataFrame({'A': ['abc', 'def', 'abacus']})pd.eval('df.A.str.contains('ab')', engine='python')0 真1 错误2 真名称:A,dtype:bool

不幸的是,与 numexpr 引擎相比,此方法没有性能优势,并且几乎没有安全措施来确保不评估危险表达式,因此 使用风险自负!通常不建议将此选项更改为 'python' 除非您知道自己在做什么.

1c) local_dictglobal_dict 参数

有时,为表达式中使用的变量提供值很有用,但当前未在您的命名空间中定义.您可以将字典传递给 local_dict

例如:

pd.eval(df1 > thresh")UndefinedVariableError: 名称 'thresh' 未定义

这会失败,因为 thresh 未定义.但是,这有效:

pd.eval(df1 > thresh", local_dict={'thresh': 10})

当您需要从字典中提供变量时,这很有用.或者,使用 Python 引擎,您可以简单地执行以下操作:

mydict = {'thresh': 5}# 没有 *string* 键的字典值不能被访问# 使用python"引擎.pd.eval('df1 > mydict['thresh']', engine='python')

但这可能比使用 'numexpr' 引擎和将字典传递给 local_dict 慢很多global_dict.希望这能为使用这些参数提供令人信服的论据.

1d) target (+ inplace) 参数和赋值表达式

这通常不是必需的,因为通常有更简单的方法可以做到这一点,但是您可以将 pd.eval 的结果分配给实现 __getitem__ 的对象,例如作为 dicts 和(你猜对了)DataFrames.

考虑问题中的例子

<块引用>

x = 5df2['D'] = df1['A'] + (df1['B'] * x)

分配一列D"df2,我们做

pd.eval('D = df1.A + (df1.B * x)', target=df2)A B C D0 5 9 8 51 4 3 0 522 5 0 2 223 8 1 3 484 3 7 0 42

这不是对 df2 的就地修改(但它可以……继续阅读).再看一个例子:

pd.eval('df1.A + df2.A')0 101 112 73 164 10数据类型:int32

如果您想(例如)将其分配回 DataFrame,您可以使用 target 参数,如下所示:

df = pd.DataFrame(columns=list('FBGH'), index=df1.index)dfF B G H0 南南南南南南1 南南南南南南2 南南南南南南3 南南南南南南4 南南南南南南df = pd.eval('B = df1.A + df2.A', target=df)# 相似# df = df.assign(B=pd.eval('df1.A + df2.A'))dfF B G H0 南 10 南 南1 南 11 南 南2 南 7 南 南3 南 16 南 南4 南 10 南 南

如果您想对 df 执行就地更改,请设置 inplace=True.

pd.eval('B = df1.A + df2.A', target=df, inplace=True)# 相似# df['B'] = pd.eval('df1.A + df2.A')dfF B G H0 南 10 南 南1 南 11 南 南2 南 7 南 南3 南 16 南 南4 南 10 南 南

如果在没有目标的情况下设置 inplace,则会引发 ValueError.

虽然 target 参数玩起来很有趣,但您很少需要使用它.

如果你想用 df.eval 做到这一点,你可以使用一个涉及赋值的表达式:

df = df.eval("B = @df1.A + @df2.A")# df.eval("B = @df1.A + @df2.A", inplace=True)dfF B G H0 南 10 南 南1 南 11 南 南2 南 7 南 南3 南 16 南 南4 南 10 南 南

注意

pd.eval 的一个意外用途是以与 ast.literal_eval 非常相似的方式解析文字字符串:

pd.eval("[1, 2, 3]")数组([1, 2, 3], dtype=object)

它还可以使用 'python' 引擎解析嵌套列表:

pd.eval("[[1, 2, 3], [4, 5], [10]]", engine='python')[[1, 2, 3], [4, 5], [10]]

和字符串列表:

pd.eval([[1, 2, 3]", [4, 5]", [10]"], engine='python')[[1, 2, 3], [4, 5], [10]]

然而,问题在于长度大于 100 的列表:

pd.eval(["[1]"] * 100, engine='python') # 有效pd.eval(["[1]"] * 101, engine='python')AttributeError: 'PandasExprVisitor' 对象没有属性 'visit_Ellipsis'

可以找到有关此错误、原因、修复和解决方法的更多信息 此处.


2) DataFrame.eval:

如上所述,df.eval 在幕后调用 pd.eval,并使用一些参数并列.v0.23 源代码 显示:

def eval(self, expr, inplace=False, **kwargs):从 pandas.core.computation.eval 导入 eval 为 _eval就地 = validate_bool_kwarg(就地,'就地')resolvers = kwargs.pop('resolvers', None)kwargs['level'] = kwargs.pop('level', 0) + 1如果解析器为无:index_resolvers = self._get_index_resolvers()解析器 = dict(self.iteritems()), index_resolvers如果目标"不在 kwargs 中:kwargs['target'] = selfkwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)返回 _eval(expr, inplace=inplace, **kwargs)

eval 创建参数,进行一些验证,然后将参数传递给 pd.eval.

有关更多信息,您可以阅读:何时使用 DataFrame.eval() 与 pandas.eval() 或 Python eval()


2a) 用法差异

2a1) DataFrames 的表达式与系列表达式

对于与整个 DataFrame 关联的动态查询,您应该首选 pd.eval.例如,当您调用 df1.evaldf2 时,没有简单的方法可以指定 pd.eval("df1 + df2") 的等效项.评估.

2a2) 指定列名

另一个主要区别是访问列的方式.例如,要添加两列A"和B"在 df1 中,您将使用以下表达式调用 pd.eval:

pd.eval(df1.A + df1.B")

使用 df.eval,您只需要提供列名:

df1.eval(A + B")

因为在 df1 的上下文中,很明显A"和B"参考列名.

您还可以使用 index 来引用索引和列(除非索引已命名,在这种情况下您将使用名称).

df1.eval(A + 索引")

或者,更一般地说,对于索引具有 1 个或多个级别的任何 DataFrame,您可以使用变量 ilevel_k"在表达式中引用索引的第 k 级; 代表i级别 k 的索引".IOW,上面的表达式可以写成df1.eval("A + ilevel_0").

这些规则也适用于 df.query.

2a3) 访问本地/全局命名空间中的变量

表达式中提供的变量必须以@"开头.符号,以避免与列名混淆.

A = 5df1.eval("A > @A")

query 也是如此.

不用说,您的列名必须遵循 Python 中有效标识符命名规则才能在 eval 中访问.有关命名标识符的规则列表,请参阅此处.

2a4) 多行查询和赋值

一个鲜为人知的事实是 eval 支持处理赋值的多行表达式(而 query 不支持).例如,要创建两个新列E"和F"在 df1 中基于一些列的一些算术运算,以及第三列G"基于先前创建的E"和F",我们可以做

df1.eval(""";E = A + BF = @df2.A + @df2.BG = E > = F""")A B C D E F G0 5 0 3 3 5 14 错误1 7 9 3 5 16 7 真2 2 4 7 6 6 5 真3 8 8 1 6 16 9 真4 7 7 8 1 14 10 真


3) eval vs query

df.query 视为使用 pd.eval 作为子例程的函数会有所帮助.

通常情况下,query(顾名思义)用于评估条件表达式(即结果为 True/False 值的表达式)并返回与 True 对应的行代码>结果.然后将表达式的结果传递给 loc(在大多数情况下)以返回满足表达式的行.根据文档,

<块引用>

这个表达式的求值结果首先传递给DataFrame.loc 如果由于多维键而失败(例如,DataFrame)然后结果将传递给DataFrame.__getitem__().

此方法使用顶级 pandas.eval() 函数来评估通过查询.

就相似性而言,querydf.eval 在访问列名和变量的方式上都相似.

两者之间的主要区别,如上所述,是它们如何处理表达式结果.当您通过这两个函数实际运行表达式时,这一点变得显而易见.例如,考虑

df1.A0 51 72 23 84 7名称:A,数据类型:int32DF1.B0 91 32 03 14 7名称:B,数据类型:int32

获取A"所在的所有行>=B"在 df1 中,我们会像这样使用 eval:

m = df1.eval("A >= B")米0 真1 错误2 错误3 真4 真数据类型:布尔

m 表示通过评估表达式A >= B"生成的中间结果.然后我们使用掩码过滤df1:

df1[m]#df1.loc[m]A B C D0 5 0 3 33 8 8 1 64 7 7 8 1

然而,对于query,中间结果m"不可用.直接传递给loc,所以使用query,你只需要做

df1.query("A >= B")A B C D0 5 0 3 33 8 8 1 64 7 7 8 1

性能方面,完全相同.

df1_big = pd.concat([df1] * 100000, ignore_index=True)%timeit df1_big[df1_big.eval("A >= B")]%timeit df1_big.query("A >= B")每个循环 14.7 ms ± 33.9 µs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)每个循环 14.7 ms ± 24.3 µs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)

但后者更简洁,在一个步骤中表达相同的操作.

请注意,您也可以像这样使用 query 做一些奇怪的事情(例如,返回由 df1.index 索引的所有行)

df1.query(索引")# 与 df1.loc[df1.index] 相同 # 毫无意义,...我知道A B C D0 5 0 3 31 7 9 3 52 2 4 7 63 8 8 1 64 7 7 8 1

但不要.

底线:根据条件表达式查询或过滤行时,请使用query.

I would like to perform arithmetic on one or more dataframes columns using pd.eval. Specifically, I would like to port the following code that evaluates a formula:

x = 5
df2['D'] = df1['A'] + (df1['B'] * x)

...to code using pd.eval. The reason for using pd.eval is that I would like to automate many workflows, so creating them dynamically will be useful to me.

My two input DataFrames are:

import pandas as pd
import numpy as np

np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))

df1
   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6
3  8  8  1  6
4  7  7  8  1

df2
   A  B  C  D
0  5  9  8  9
1  4  3  0  3
2  5  0  2  3
3  8  1  3  3
4  3  7  0  1

I am trying to better understand pd.eval's engine and parser arguments to determine how best to solve my problem. I have gone through the documentation, but the difference was not made clear to me.

  1. What arguments should be used to ensure my code is working at the maximum performance?
  2. Is there a way to assign the result of the expression back to df2?
  3. Also, to make things more complicated, how do I pass x as an argument inside the string expression?

解决方案

You can use 1) pd.eval(), 2) df.query(), or 3) df.eval(). Their various features and functionality are discussed below.

Examples will involve these dataframes (unless otherwise specified).

np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df3 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df4 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))


1) pandas.eval

This is the "Missing Manual" that pandas doc should contain. Note: of the three functions being discussed, pd.eval is the most important. df.eval and df.query call pd.eval under the hood. Behaviour and usage is more or less consistent across the three functions, with some minor semantic variations which will be highlighted later. This section will introduce functionality that is common across all the three functions - this includes, (but not limited to) allowed syntax, precedence rules, and keyword arguments.

pd.eval can evaluate arithmetic expressions which can consist of variables and/or literals. These expressions must be passed as strings. So, to answer the question as stated, you can do

x = 5
pd.eval("df1.A + (df1.B * x)")

Some things to note here:

  1. The entire expression is a string
  2. df1, df2, and x refer to variables in the global namespace, these are picked up by eval when parsing the expression
  3. Specific columns are accessed using the attribute accessor index. You can also use "df1['A'] + (df1['B'] * x)" to the same effect.

I will be addressing the specific issue of reassignment in the section explaining the target=... attribute below. But for now, here are more simple examples of valid operations with pd.eval:

pd.eval("df1.A + df2.A")   # Valid, returns a pd.Series object
pd.eval("abs(df1) ** .5")  # Valid, returns a pd.DataFrame object

...and so on. Conditional expressions are also supported in the same way. The statements below are all valid expressions and will be evaluated by the engine.

pd.eval("df1 > df2")
pd.eval("df1 > 5")
pd.eval("df1 < df2 and df3 < df4")
pd.eval("df1 in [1, 2, 3]")
pd.eval("1 < 2 < 3")

A list detailing all the supported features and syntax can be found in the documentation. In summary,

  • Arithmetic operations except for the left shift (<<) and right shift (>>) operators, e.g., df + 2 * pi / s ** 4 % 42 - the_golden_ratio
  • Comparison operations, including chained comparisons, e.g., 2 < df < df2
  • Boolean operations, e.g., df < df2 and df3 < df4 or not df_bool list and tuple literals, e.g., [1, 2] or (1, 2)
  • Attribute access, e.g., df.a
  • Subscript expressions, e.g., df[0]
  • Simple variable evaluation, e.g., pd.eval('df') (this is not very useful)
  • Math functions: sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and arctan2.

This section of the documentation also specifies syntax rules that are not supported, including set/dict literals, if-else statements, loops, and comprehensions, and generator expressions.

From the list, it is obvious you can also pass expressions involving the index, such as

pd.eval('df1.A * (df1.index > 1)')

1a) Parser Selection: The parser=... argument

pd.eval supports two different parser options when parsing the expression string to generate the syntax tree: pandas and python. The main difference between the two is highlighted by slightly differing precedence rules.

Using the default parser pandas, the overloaded bitwise operators & and | which implement vectorized AND and OR operations with pandas objects will have the same operator precedence as and and or. So,

pd.eval("(df1 > df2) & (df3 < df4)")

Will be the same as

pd.eval("df1 > df2 & df3 < df4")
# pd.eval("df1 > df2 & df3 < df4", parser='pandas')

And also the same as

pd.eval("df1 > df2 and df3 < df4")

Here, the parentheses are necessary. To do this conventionally, the parentheses would be required to override the higher precedence of bitwise operators:

(df1 > df2) & (df3 < df4)

Without that, we end up with

df1 > df2 & df3 < df4

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Use parser='python' if you want to maintain consistency with python's actual operator precedence rules while evaluating the string.

pd.eval("(df1 > df2) & (df3 < df4)", parser='python')

The other difference between the two types of parsers are the semantics of the == and != operators with list and tuple nodes, which have the similar semantics as in and not in respectively, when using the 'pandas' parser. For example,

pd.eval("df1 == [1, 2, 3]")

Is valid, and will run with the same semantics as

pd.eval("df1 in [1, 2, 3]")

OTOH, pd.eval("df1 == [1, 2, 3]", parser='python') will throw a NotImplementedError error.

1b) Backend Selection: The engine=... argument

There are two options - numexpr (the default) and python. The numexpr option uses the numexpr backend which is optimized for performance.

With Python backend, your expression is evaluated similar to just passing the expression to Python's eval function. You have the flexibility of doing more inside expressions, such as string operations, for instance.

df = pd.DataFrame({'A': ['abc', 'def', 'abacus']})
pd.eval('df.A.str.contains("ab")', engine='python')

0     True
1    False
2     True
Name: A, dtype: bool

Unfortunately, this method offers no performance benefits over the numexpr engine, and there are very few security measures to ensure that dangerous expressions are not evaluated, so use at your own risk! It is generally not recommended to change this option to 'python' unless you know what you're doing.

1c) local_dict and global_dict arguments

Sometimes, it is useful to supply values for variables used inside expressions, but not currently defined in your namespace. You can pass a dictionary to local_dict

For example:

pd.eval("df1 > thresh")

UndefinedVariableError: name 'thresh' is not defined

This fails because thresh is not defined. However, this works:

pd.eval("df1 > thresh", local_dict={'thresh': 10})

This is useful when you have variables to supply from a dictionary. Alternatively, with the Python engine, you could simply do this:

mydict = {'thresh': 5}
# Dictionary values with *string* keys cannot be accessed without
# using the 'python' engine.
pd.eval('df1 > mydict["thresh"]', engine='python')

But this is going to possibly be much slower than using the 'numexpr' engine and passing a dictionary to local_dict or global_dict. Hopefully, this should make a convincing argument for the use of these parameters.

1d) The target (+ inplace) argument, and Assignment Expressions

This is not often a requirement because there are usually simpler ways of doing this, but you can assign the result of pd.eval to an object that implements __getitem__ such as dicts, and (you guessed it) DataFrames.

Consider the example in the question

x = 5
df2['D'] = df1['A'] + (df1['B'] * x)

To assign a column "D" to df2, we do

pd.eval('D = df1.A + (df1.B * x)', target=df2)

   A  B  C   D
0  5  9  8   5
1  4  3  0  52
2  5  0  2  22
3  8  1  3  48
4  3  7  0  42

This is not an in-place modification of df2 (but it can be... read on). Consider another example:

pd.eval('df1.A + df2.A')

0    10
1    11
2     7
3    16
4    10
dtype: int32

If you wanted to (for example) assign this back to a DataFrame, you could use the target argument as follows:

df = pd.DataFrame(columns=list('FBGH'), index=df1.index)
df
     F    B    G    H
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN

df = pd.eval('B = df1.A + df2.A', target=df)
# Similar to
# df = df.assign(B=pd.eval('df1.A + df2.A'))

df
     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN

If you wanted to perform an in-place mutation on df, set inplace=True.

pd.eval('B = df1.A + df2.A', target=df, inplace=True)
# Similar to
# df['B'] = pd.eval('df1.A + df2.A')

df
     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN

If inplace is set without a target, a ValueError is raised.

While the target argument is fun to play around with, you will seldom need to use it.

If you wanted to do this with df.eval, you would use an expression involving an assignment:

df = df.eval("B = @df1.A + @df2.A")
# df.eval("B = @df1.A + @df2.A", inplace=True)
df

     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN

Note

One of pd.eval's unintended uses is parsing literal strings in a manner very similar to ast.literal_eval:

pd.eval("[1, 2, 3]")
array([1, 2, 3], dtype=object)

It can also parse nested lists with the 'python' engine:

pd.eval("[[1, 2, 3], [4, 5], [10]]", engine='python')
[[1, 2, 3], [4, 5], [10]]

And lists of strings:

pd.eval(["[1, 2, 3]", "[4, 5]", "[10]"], engine='python')
[[1, 2, 3], [4, 5], [10]]

The problem, however, is for lists with length larger than 100:

pd.eval(["[1]"] * 100, engine='python') # Works
pd.eval(["[1]"] * 101, engine='python')

AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'

More information can this error, causes, fixes, and workarounds can be found here.


2) DataFrame.eval:

As mentioned above, df.eval calls pd.eval under the hood, with a bit of juxtaposition of arguments. The v0.23 source code shows this:

def eval(self, expr, inplace=False, **kwargs):

    from pandas.core.computation.eval import eval as _eval

    inplace = validate_bool_kwarg(inplace, 'inplace')
    resolvers = kwargs.pop('resolvers', None)
    kwargs['level'] = kwargs.pop('level', 0) + 1
    if resolvers is None:
        index_resolvers = self._get_index_resolvers()
        resolvers = dict(self.iteritems()), index_resolvers
    if 'target' not in kwargs:
        kwargs['target'] = self
    kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
    return _eval(expr, inplace=inplace, **kwargs)

eval creates arguments, does a little validation, and passes the arguments on to pd.eval.

For more, you can read on: When to use DataFrame.eval() versus pandas.eval() or Python eval()


2a) Usage Differences

2a1) Expressions with DataFrames vs. Series Expressions

For dynamic queries associated with entire DataFrames, you should prefer pd.eval. For example, there is no simple way to specify the equivalent of pd.eval("df1 + df2") when you call df1.eval or df2.eval.

2a2) Specifying Column Names

Another other major difference is how columns are accessed. For example, to add two columns "A" and "B" in df1, you would call pd.eval with the following expression:

pd.eval("df1.A + df1.B")

With df.eval, you need only supply the column names:

df1.eval("A + B")

Since, within the context of df1, it is clear that "A" and "B" refer to column names.

You can also refer to the index and columns using index (unless the index is named, in which case you would use the name).

df1.eval("A + index")

Or, more generally, for any DataFrame with an index having 1 or more levels, you can refer to the kth level of the index in an expression using the variable "ilevel_k" which stands for "index at level k". IOW, the expression above can be written as df1.eval("A + ilevel_0").

These rules also apply to df.query.

2a3) Accessing Variables in Local/Global Namespace

Variables supplied inside expressions must be preceded by the "@" symbol, to avoid confusion with column names.

A = 5
df1.eval("A > @A")

The same goes for query.

It goes without saying that your column names must follow the rules for valid identifier naming in Python to be accessible inside eval. See here for a list of rules on naming identifiers.

2a4) Multiline Queries and Assignment

A little known fact is that eval supports multiline expressions that deal with assignment (whereas query doesn't). For example, to create two new columns "E" and "F" in df1 based on some arithmetic operations on some columns, and a third column "G" based on the previously created "E" and "F", we can do

df1.eval("""
E = A + B
F = @df2.A + @df2.B
G = E >= F
""")

   A  B  C  D   E   F      G
0  5  0  3  3   5  14  False
1  7  9  3  5  16   7   True
2  2  4  7  6   6   5   True
3  8  8  1  6  16   9   True
4  7  7  8  1  14  10   True


3) eval vs query

It helps to think of df.query as a function that uses pd.eval as a subroutine.

Typically, query (as the name suggests) is used to evaluate conditional expressions (i.e., expressions that result in True/False values) and return the rows corresponding to the True result. The result of the expression is then passed to loc (in most cases) to return the rows that satisfy the expression. According to the documentation,

The result of the evaluation of this expression is first passed to DataFrame.loc and if that fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed to DataFrame.__getitem__().

This method uses the top-level pandas.eval() function to evaluate the passed query.

In terms of similarity, query and df.eval are both alike in how they access column names and variables.

This key difference between the two, as mentioned above is how they handle the expression result. This becomes obvious when you actually run an expression through these two functions. For example, consider

df1.A

0    5
1    7
2    2
3    8
4    7
Name: A, dtype: int32

df1.B

0    9
1    3
2    0
3    1
4    7
Name: B, dtype: int32

To get all rows where "A" >= "B" in df1, we would use eval like this:

m = df1.eval("A >= B")
m
0     True
1    False
2    False
3     True
4     True
dtype: bool

m represents the intermediate result generated by evaluating the expression "A >= B". We then use the mask to filter df1:

df1[m]
# df1.loc[m]

   A  B  C  D
0  5  0  3  3
3  8  8  1  6
4  7  7  8  1

However, with query, the intermediate result "m" is directly passed to loc, so with query, you would simply need to do

df1.query("A >= B")

   A  B  C  D
0  5  0  3  3
3  8  8  1  6
4  7  7  8  1

Performance wise, it is exactly the same.

df1_big = pd.concat([df1] * 100000, ignore_index=True)

%timeit df1_big[df1_big.eval("A >= B")]
%timeit df1_big.query("A >= B")

14.7 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.7 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But the latter is more concise, and expresses the same operation in a single step.

Note that you can also do weird stuff with query like this (to, say, return all rows indexed by df1.index)

df1.query("index")
# Same as df1.loc[df1.index] # Pointless,... I know

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6
3  8  8  1  6
4  7  7  8  1

But don't.

Bottom line: Please use query when querying or filtering rows based on a conditional expression.

这篇关于从 Pandas 中的公式动态计算表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆