使用pd.eval()在 pandas 中进行动态表达评估 [英] Dynamic Expression Evaluation in pandas using pd.eval()

查看:113
本文介绍了使用pd.eval()在 pandas 中进行动态表达评估的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出两个数据框

np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))

df1
   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6
3  8  8  1  6
4  7  7  8  1

df2
   A  B  C  D
0  5  9  8  9
1  4  3  0  3
2  5  0  2  3
3  8  1  3  3
4  3  7  0  1

我想使用pd.eval在一个或多个列上执行算术运算.具体来说,我想移植以下代码:

I would like to perform arithmetic on one or more columns using pd.eval. Specifically, I would like to port the following code:

x = 5
df2['D'] = df1['A'] + (df1['B'] * x) 

...使用eval进行编码.使用eval的原因是我想使许多工作流程自动化,所以动态创建它们对我很有用.

...to code using eval. The reason for using eval is that I would like to automate many workflows, so creating them dynamically will be useful to me.

我试图更好地理解engineparser参数,以确定如何最好地解决我的问题.我已经阅读了文档,但是对我来说,区别并不明显.

I am trying to better understand the engine and parser arguments to determine how best to solve my problem. I have gone through the documentation but the difference was not made clear to me.

  1. 应使用哪些参数来确保我的代码以最高性能工作?
  2. 是否可以将表达式的结果分配回df2?
  3. 此外,为了使事情更复杂,如何在字符串表达式内将x作为参数传递?
  1. What arguments should be used to ensure my code is working at max performance?
  2. Is there a way to assign the result of the expression back to df2?
  3. Also, to make things more complicated, how do I pass x as an argument inside the string expression?

推荐答案

此答案将深入探讨

This answer dives into the various features and functionality offered by pd.eval, df.query, and df.eval.

设置
示例将涉及这些DataFrame(除非另有说明).

Setup
Examples will involve these DataFrames (unless otherwise specified).

np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df3 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df4 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))


pandas.eval - 缺少手册"


pandas.eval - The "Missing Manual"

注意
在讨论的三个功能中,pd.eval是最重要的. df.evaldf.query调用 引擎盖下的pd.eval.行为和用法或多或少 这三个功能保持一致,并具有一些次要语义 变体,稍后将重点介绍.本节将 引入所有这三个功能共有的功能-包括(但不限于)允许的语法,优先级规则关键字参数.

Note
Of the three functions being discussed, pd.eval is the most important. df.eval and df.query call pd.eval under the hood. Behaviour and usage is more or less consistent across the three functions, with some minor semantic variations which will be highlighted later. This section will introduce functionality that is common across all the three functions - this includes, (but not limited to) allowed syntax, precedence rules, and keyword arguments.

pd.eval可以计算算术表达式,该算术表达式可以由变量和/或文字组成.这些表达式必须作为字符串传递.因此,按照说明回答问题,您可以

pd.eval can evaluate arithmetic expressions which can consist of variables and/or literals. These expressions must be passed as strings. So, to answer the question as stated, you can do

x = 5
pd.eval("df1.A + (df1.B * x)")  

这里要注意一些事情:

  1. 整个表达式是一个字符串
  2. df1df2x引用全局命名空间中的变量,它们在解析表达式时由eval拾取.
  3. 使用属性访问器索引访问特定列.您也可以使用"df1['A'] + (df1['B'] * x)"达到相同的效果.
  1. The entire expression is a string
  2. df1, df2, and x refer to variables in the global namespace, these are picked up by eval when parsing the expression
  3. Specific columns are accessed using the attribute accessor index. You can also use "df1['A'] + (df1['B'] * x)" to the same effect.

我将在下面解释target=...属性的部分中讨论重新分配的特定问题.但是目前,这是使用pd.eval进行有效操作的更简单示例:

I will be addressing the specific issue of reassignment in the section explaining the target=... attribute below. But for now, here are more simple examples of valid operations with pd.eval:

pd.eval("df1.A + df2.A")   # Valid, returns a pd.Series object
pd.eval("abs(df1) ** .5")  # Valid, returns a pd.DataFrame object

...等等.条件表达式也以相同的方式受支持.下面的语句都是有效表达式,将由引擎进行评估.

...and so on. Conditional expressions are also supported in the same way. The statements below are all valid expressions and will be evaluated by the engine.

pd.eval("df1 > df2")        
pd.eval("df1 > 5")    
pd.eval("df1 < df2 and df3 < df4")      
pd.eval("df1 in [1, 2, 3]")
pd.eval("1 < 2 < 3")

可以在文档.总之,

  • 除左移(<<)和右移(>>)运算符外的算术运算,例如df + 2 * pi / s ** 4 % 42-the_golden_ratio
  • 比较操作,包括链式比较,例如2 < df < df2
  • 布尔运算,例如df < df2 and df3 < df4not df_bool listtuple文字,例如[1, 2](1, 2)
  • 属性访问,例如df.a
  • 下标表达式,例如df[0]
  • 简单的变量求值,例如pd.eval('df')(这不是很有用)
  • 数学函数:sin,cos,exp,log,expm1,log1p,sqrt,sinh,cosh,tanh,arcsin,arccos,arctan,arcosh,arcsinh,arctanh,abs和 arctan2.
  • Arithmetic operations except for the left shift (<<) and right shift (>>) operators, e.g., df + 2 * pi / s ** 4 % 42 - the_golden_ratio
  • Comparison operations, including chained comparisons, e.g., 2 < df < df2
  • Boolean operations, e.g., df < df2 and df3 < df4 or not df_bool list and tuple literals, e.g., [1, 2] or (1, 2)
  • Attribute access, e.g., df.a
  • Subscript expressions, e.g., df[0]
  • Simple variable evaluation, e.g., pd.eval('df') (this is not very useful)
  • Math functions: sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and arctan2.

文档的此部分还指定了不支持的语法规则,包括set/dict文字,if-else语句,循环和理解以及生成器表达式.

This section of the documentation also specifies syntax rules that are not supported, including set/dict literals, if-else statements, loops, and comprehensions, and generator expressions.

从列表中可以明显看出,您还可以传递涉及索引的表达式,例如

From the list, it is obvious you can also pass expressions involving the index, such as

pd.eval('df1.A * (df1.index > 1)')

解析器选择:parser=...参数

pd.eval在解析表达式字符串以生成语法树时支持两个不同的解析器选项:pandaspython.优先顺序略有不同,这突出表明了两者之间的主要区别.

Parser Selection: The parser=... argument

pd.eval supports two different parser options when parsing the expression string to generate the syntax tree: pandas and python. The main difference between the two is highlighted by slightly differing precedence rules.

使用默认解析器pandas,使用pandas对象实现矢量化AND和OR的重载按位运算符&|将具有与andor相同的运算符优先级.因此,

Using the default parser pandas, the overloaded bitwise operators & and | which implement vectorized AND and OR operations with pandas objects will have the same operator precedence as and and or. So,

pd.eval("(df1 > df2) & (df3 < df4)")

将与

pd.eval("df1 > df2 & df3 < df4")
# pd.eval("df1 > df2 & df3 < df4", parser='pandas')

也与

pd.eval("df1 > df2 and df3 < df4")

在这里,括号是必需的.按照常规方式,需要使用parens来覆盖按位运算符的更高优先级:

Here, the parentheses are necessary. To do this conventionally, the parens would be required to override the higher precedence of bitwise operators:

(df1 > df2) & (df3 < df4)

否则,我们最终会结束

df1 > df2 & df3 < df4

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

如果要在评估字符串时保持与python实际运算符优先级规则的一致性,请使用parser='python'.

Use parser='python' if you want to maintain consistency with python's actual operator precedence rules while evaluating the string.

pd.eval("(df1 > df2) & (df3 < df4)", parser='python')

两种类型的解析器之间的另一个区别是具有列表和元组节点的==!=运算符的语义,当使用<时,它们分别具有与innot in相似的语义. c52>解析器.例如,

The other difference between the two types of parsers are the semantics of the == and != operators with list and tuple nodes, which have the similar semantics as in and not in respectively, when using the 'pandas' parser. For example,

pd.eval("df1 == [1, 2, 3]")

有效,并且将以与

pd.eval("df1 in [1, 2, 3]")

OTOH,pd.eval("df1 == [1, 2, 3]", parser='python')将引发NotImplementedError错误.

OTOH, pd.eval("df1 == [1, 2, 3]", parser='python') will throw a NotImplementedError error.

有两个选项-numexpr(默认)和python. numexpr选项使用为性能优化的 numexpr 后端.

There are two options - numexpr (the default) and python. The numexpr option uses the numexpr backend which is optimized for performance.

使用'python'后端,您可以对表达式进行求值,类似于仅将表达式传递给python的eval函数.您可以灵活地执行更多内部表达式,例如字符串操作.

With 'python' backend, your expression is evaluated similar to just passing the expression to python's eval function. You have the flexibility of doing more inside expressions, such as string operations, for instance.

df = pd.DataFrame({'A': ['abc', 'def', 'abacus']})
pd.eval('df.A.str.contains("ab")', engine='python')

0     True
1    False
2     True
Name: A, dtype: bool

不幸的是,与numexpr引擎相比,此方法没有提供 no 性能优势,并且几乎没有安全措施可确保不评估危险的表达式,因此由您自己承担风险!除非您知道自己在做什么,否则通常不建议将此选项更改为'python'.

Unfortunately, this method offers no performance benefits over the numexpr engine, and there are very few security measures to ensure that dangerous expressions are not evaluated, so USE AT YOUR OWN RISK! It is generally not recommended to change this option to 'python' unless you know what you're doing.

有时,为表达式中使用的变量提供值非常有用,但当前尚未在名称空间中定义变量.您可以将字典传递给local_dict

Sometimes, it is useful to supply values for variables used inside expressions, but not currently defined in your namespace. You can pass a dictionary to local_dict

例如,

pd.eval("df1 > thresh")

UndefinedVariableError: name 'thresh' is not defined

此操作失败,因为未定义thresh.但是,这可行:

This fails because thresh is not defined. However, this works:

pd.eval("df1 > thresh", local_dict={'thresh': 10})

当您要从字典中提供变量时,这很有用.另外,使用'python'引擎,您可以简单地执行以下操作:

This is useful when you have variables to supply from a dictionary. Alternatively, with the 'python' engine, you could simply do this:

mydict = {'thresh': 5}
# Dictionary values with *string* keys cannot be accessed without 
# using the 'python' engine.
pd.eval('df1 > mydict["thresh"]', engine='python')

但是,这可能比使用'numexpr'引擎并将字典传递给local_dictglobal_dict的速度慢得多.希望这应该为使用这些参数提供令人信服的论据.

But this is going to possibly be much slower than using the 'numexpr' engine and passing a dictionary to local_dict or global_dict. Hopefully, this should make a convincing argument for the use of these parameters.

这通常不是必需的,因为通常有更简单的方法可以执行此操作,但是您可以将pd.eval的结果分配给实现__getitem__的对象,例如dict s,并且(您猜到了)数据框.

This is not often a requirement because there are usually simpler ways of doing this, but you can assign the result of pd.eval to an object that implements __getitem__ such as dicts, and (you guessed it) DataFrames.

考虑问题中的示例

x = 5
df2['D'] = df1['A'] + (df1['B'] * x)

我们要为df2分配一列"D",

To assign a column "D" to df2, we do

pd.eval('D = df1.A + (df1.B * x)', target=df2)

   A  B  C   D
0  5  9  8   5
1  4  3  0  52
2  5  0  2  22
3  8  1  3  48
4  3  7  0  42

这不是对df2的就地修改(但是可以...继续阅读).请考虑另一个示例:

This is not an in-place modification of df2 (but it can be... read on). Consider another example:

pd.eval('df1.A + df2.A')

0    10
1    11
2     7
3    16
4    10
dtype: int32

例如,如果您想将此分配回DataFrame,则可以使用target参数,如下所示:

If you wanted to (for example) assign this back to a DataFrame, you could use the target argument as follows:

df = pd.DataFrame(columns=list('FBGH'), index=df1.index)
df
     F    B    G    H
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN

df = pd.eval('B = df1.A + df2.A', target=df)
# Similar to 
# df = df.assign(B=pd.eval('df1.A + df2.A'))

df
     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN

如果要在df上执行就地突变,请设置inplace=True.

If you wanted to perform an in-place mutation on df, set inplace=True.

pd.eval('B = df1.A + df2.A', target=df, inplace=True)
# Similar to 
# df['B'] = pd.eval('df1.A + df2.A')

df
     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN

如果将inplace设置为没有目标,则会引发ValueError.

If inplace is set without a target, a ValueError is raised.

虽然target参数很有趣,但是您几乎不需要使用它.

While the target argument is fun to play around with, you will seldom need to use it.

如果要对df.eval执行此操作,则可以使用涉及赋值的表达式:

If you wanted to do this with df.eval, you would use an expression involving an assignment:

df = df.eval("B = @df1.A + @df2.A")
# df.eval("B = @df1.A + @df2.A", inplace=True)
df

     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN

注意
pd.eval的一种意外用途是以与ast.literal_eval非常相似的方式解析文字字符串:

Note
One of pd.eval's unintended uses is parsing literal strings in a manner very similar to ast.literal_eval:

pd.eval("[1, 2, 3]")
array([1, 2, 3], dtype=object)

它还可以使用'python'引擎解析嵌套列表:

It can also parse nested lists with the 'python' engine:

pd.eval("[[1, 2, 3], [4, 5], [10]]", engine='python')
[[1, 2, 3], [4, 5], [10]]

以及字符串列表:

pd.eval(["[1, 2, 3]", "[4, 5]", "[10]"], engine='python')
[[1, 2, 3], [4, 5], [10]]

但是,问题在于长度大于100的列表:

The problem, however, is for lists with length larger than 100:

pd.eval(["[1]"] * 100, engine='python') # Works
pd.eval(["[1]"] * 101, engine='python') 

AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'

可以找到有关此错误,原因,修复和解决方法的更多信息,

More information can this error, causes, fixes, and workarounds can be found here.

如上所述,df.eval在后台调用pd.eval. v0.23源代码显示以下内容:

As mentioned above, df.eval calls pd.eval under the hood. The v0.23 source code shows this:

def eval(self, expr, inplace=False, **kwargs):

    from pandas.core.computation.eval import eval as _eval

    inplace = validate_bool_kwarg(inplace, 'inplace')
    resolvers = kwargs.pop('resolvers', None)
    kwargs['level'] = kwargs.pop('level', 0) + 1
    if resolvers is None:
        index_resolvers = self._get_index_resolvers()
        resolvers = dict(self.iteritems()), index_resolvers
    if 'target' not in kwargs:
        kwargs['target'] = self
    kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
    return _eval(expr, inplace=inplace, **kwargs)

eval创建参数,进行一些验证,然后将参数传递给pd.eval.

eval creates arguments, does a little validation, and passes the arguments on to pd.eval.

有关更多信息,您可以阅读:何时使用DataFrame.eval()与pandas.eval()或python eval()

For more, you can read on: when to use DataFrame.eval() versus pandas.eval() or python eval()

对于与整个DataFrame相关的动态查询,您应该首选pd.eval.例如,当您调用df1.evaldf2.eval时,没有简单的方法来指定pd.eval("df1 + df2")的等效项.

For dynamic queries associated with entire DataFrames, you should prefer pd.eval. For example, there is no simple way to specify the equivalent of pd.eval("df1 + df2") when you call df1.eval or df2.eval.

另一个主要区别是如何访问列.例如,要在df1中添加两列"A"和"B",则可以使用以下表达式调用pd.eval:

Another other major difference is how columns are accessed. For example, to add two columns "A" and "B" in df1, you would call pd.eval with the following expression:

pd.eval("df1.A + df1.B")

使用df.eval,只需提供列名称:

With df.eval, you need only supply the column names:

df1.eval("A + B")

因为在df1上下文中,很明显"A"和"B"是指列名.

Since, within the context of df1, it is clear that "A" and "B" refer to column names.

您还可以使用index引用索引和列(除非已命名索引,在这种情况下,您将使用名称).

You can also refer to the index and columns using index (unless the index is named, in which case you would use the name).

df1.eval("A + index")

或者,更一般而言,对于具有1个或多个级别索引的任何DataFrame,您可以使用变量"ilevel_k"在表达式中引用索引的第k 代表" k级 i ndex". IOW,上面的表达式可以写为df1.eval("A + ilevel_0").

Or, more generally, for any DataFrame with an index having 1 or more levels, you can refer to the kth level of the index in an expression using the variable "ilevel_k" which stands for "index at level k". IOW, the expression above can be written as df1.eval("A + ilevel_0").

这些规则也适用于query.

表达式中提供的变量必须以"@"符号开头,以避免与列名混淆.

Variables supplied inside expressions must be preceeded by the "@" symbol, to avoid confusion with column names.

A = 5
df1.eval("A > @A") 

query也是如此.

不用说,列名必须遵循python中有效标识符命名的规则,以便可以在eval内部访问.有关命名标识符的规则列表,请参见此处.

It goes without saying that your column names must follow the rules for valid identifier naming in python to be accessible inside eval. See here for a list of rules on naming identifiers.

一个鲜为人知的事实是eval支持处理赋值的多行表达式.例如,要基于某些列上的某些算术运算在df1中创建两个新列"E"和"F",并基于先前创建的"E"和"F"来创建第三列"G",我们可以

A little known fact is that eval support multiline expressions that deal with assignment. For example, to create two new columns "E" and "F" in df1 based on some arithmetic operations on some columns, and a third column "G" based on the previously created "E" and "F", we can do

df1.eval("""
E = A + B
F = @df2.A + @df2.B
G = E >= F
""")

   A  B  C  D   E   F      G
0  5  0  3  3   5  14  False
1  7  9  3  5  16   7   True
2  2  4  7  6   6   5   True
3  8  8  1  6  16   9   True
4  7  7  8  1  14  10   True

...好漂亮!但是,请注意,query不支持此功能.

...Nifty! However, note that this is not supported by query.

df.query视为使用pd.eval作为子例程的函数会有所帮助.

It helps to think of df.query as a function that uses pd.eval as a subroutine.

通常,query(顾名思义)用于评估条件表达式(即产生True/False值的表达式)并返回与True结果相对应的行.然后将表达式的结果传递给loc(在大多数情况下)以返回满足表达式的行.根据文档,

Typically, query (as the name suggests) is used to evaluate conditional expressions (i.e., expressions that result in True/False values) and return the rows corresponding to the True result. The result of the expression is then passed to loc (in most cases) to return the rows that satisfy the expression. According to the documentation,

该表达式的求值结果首先传递给 DataFrame.loc,如果由于多维键而失败 (例如,DataFrame),则结果将传递到 DataFrame.__getitem__().

The result of the evaluation of this expression is first passed to DataFrame.loc and if that fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed to DataFrame.__getitem__().

此方法使用顶级pandas.eval()函数来评估 传递的查询.

This method uses the top-level pandas.eval() function to evaluate the passed query.

就相似性而言,querydf.eval在访问列名和变量方面都是相似的.

In terms of similarity, query and df.eval are both alike in how they access column names and variables.

如上所述,两者之间的主要区别在于它们如何处理表达式结果.当您实际上通过这两个函数运行表达式时,这一点变得显而易见.例如,考虑

This key difference between the two, as mentioned above is how they handle the expression result. This becomes obvious when you actually run an expression through these two functions. For example, consider

df1.A

0    5
1    7
2    2
3    8
4    7
Name: A, dtype: int32

df1.B

0    9
1    3
2    0
3    1
4    7
Name: B, dtype: int32

要获取df1中"A"> ="B"的所有行,我们将使用eval这样:

To get all rows where "A" >= "B" in df1, we would use eval like this:

m = df1.eval("A >= B")
m
0     True
1    False
2    False
3     True
4     True
dtype: bool

m表示通过评估表达式"A> = B"生成的中间结果.然后,我们使用遮罩过滤df1:

m represents the intermediate result generated by evaluating the expression "A >= B". We then use the mask to filter df1:

df1[m]
# df1.loc[m]

   A  B  C  D
0  5  0  3  3
3  8  8  1  6
4  7  7  8  1

但是,对于query,中间结果"m"直接传递给loc,因此对于query,您只需要做

However, with query, the intermediate result "m" is directly passed to loc, so with query, you would simply need to do

df1.query("A >= B")

   A  B  C  D
0  5  0  3  3
3  8  8  1  6
4  7  7  8  1

明智的选择,完全相同.

Performance wise, it is exactly the same.

df1_big = pd.concat([df1] * 100000, ignore_index=True)

%timeit df1_big[df1_big.eval("A >= B")]
%timeit df1_big.query("A >= B")

14.7 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.7 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

但是后者更为简洁,并且只需一步即可表达相同的操作.

But the latter is more concise, and expresses the same operation in a single step.

请注意,您还可以像这样使用query做一些奇怪的事情(例如,返回由df1.index索引的所有行)

Note that you can also do weird stuff with query like this (to, say, return all rows indexed by df1.index)

df1.query("index")
# Same as df1.loc[df1.index] # Pointless,... I know

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6
3  8  8  1  6
4  7  7  8  1

但是不要.

底线:根据条件表达式查询或过滤行时,请使用query.

Bottom line: Please use query when querying or filtering rows based on a conditional expression.

这篇关于使用pd.eval()在 pandas 中进行动态表达评估的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆