从大 pandas 中的公式动态评估表达? [英] Dynamically evaluate expression from formula in pandas?

查看:49
本文介绍了从大 pandas 中的公式动态评估表达?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 pd.eval 在一个或多个数据帧列上执行算术运算.具体来说,我想移植以下计算公式的代码:

I would like to perform arithmetic on one or more dataframes columns using pd.eval. Specifically, I would like to port the following code that evaluates a formula:

x = 5
df2['D'] = df1['A'] + (df1['B'] * x) 

...使用 pd.eval 进行编码.使用 pd.eval 的原因是我想使许多工作流程自动化,所以动态创建它们对我很有用.

...to code using pd.eval. The reason for using pd.eval is that I would like to automate many workflows, so creating them dynamically will be useful to me.

我的两个输入DataFrame是:

My two input DataFrames are:

import pandas as pd
import numpy as np

np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))

df1
   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6
3  8  8  1  6
4  7  7  8  1

df2
   A  B  C  D
0  5  9  8  9
1  4  3  0  3
2  5  0  2  3
3  8  1  3  3
4  3  7  0  1

我试图更好地理解 pd.eval engine parser 参数,以确定如何最好地解决我的问题.我已经阅读了文档但是区别对我来说还不清楚.

I am trying to better understand pd.eval's engine and parser arguments to determine how best to solve my problem. I have gone through the documentation but the difference was not made clear to me.

  1. 应该使用哪些参数来确保我的代码以最高性能运行?
  2. 是否可以将表达式的结果分配回 df2 ?
  3. 还要使事情变得更复杂,如何在字符串表达式内将 x 作为参数传递?
  1. What arguments should be used to ensure my code is working at max performance?
  2. Is there a way to assign the result of the expression back to df2?
  3. Also, to make things more complicated, how do I pass x as an argument inside the string expression?

推荐答案

您可以使用1)

You can use 1) pd.eval(), 2) df.query(), or 3) df.eval(). Their various features and functionality are discussed below.

示例将涉及这些数据框(除非另有说明).

Examples will involve these dataframes (unless otherwise specified).

np.random.seed(0)
df1 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df2 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df3 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))
df4 = pd.DataFrame(np.random.choice(10, (5, 4)), columns=list('ABCD'))


1) pandas.eval

这是《缺少手册》,熊猫文档应包含的内容.注意:在讨论的三个功能中, pd.eval 最重要. df.eval df.query 调用引擎盖下的 pd.eval .行为和用法或多或少这三个功能保持一致,并具有一些次要语义变体,稍后将重点介绍.本节将引入所有这三个功能共有的功能-包括(但不限于)允许的语法,优先级规则关键字参数.

This is the "Missing Manual" that pandas doc should contain. Note: of the three functions being discussed, pd.eval is the most important. df.eval and df.query call pd.eval under the hood. Behaviour and usage is more or less consistent across the three functions, with some minor semantic variations which will be highlighted later. This section will introduce functionality that is common across all the three functions - this includes, (but not limited to) allowed syntax, precedence rules, and keyword arguments.

pd.eval 可以评估由变量和/或文字组成的算术表达式.这些表达式必须作为字符串传递.因此,按照说明回答问题,您可以

pd.eval can evaluate arithmetic expressions which can consist of variables and/or literals. These expressions must be passed as strings. So, to answer the question as stated, you can do

x = 5
pd.eval("df1.A + (df1.B * x)")  

一些要注意的地方:

  1. 整个表达式是一个字符串
  2. df1 df2 x 引用全局命名空间中的变量,这些变量由 eval 解析表达式时
  3. 使用属性访问器索引访问特定的列.您也可以使用"df1 ['A'] +(df1 ['B'] * x)" 达到相同的效果.
  1. The entire expression is a string
  2. df1, df2, and x refer to variables in the global namespace, these are picked up by eval when parsing the expression
  3. Specific columns are accessed using the attribute accessor index. You can also use "df1['A'] + (df1['B'] * x)" to the same effect.

我将在下面解释 target = ... 属性的部分中讨论重新分配的特定问题.但是目前,这是使用 pd.eval 进行有效操作的更简单示例:

I will be addressing the specific issue of reassignment in the section explaining the target=... attribute below. But for now, here are more simple examples of valid operations with pd.eval:

pd.eval("df1.A + df2.A")   # Valid, returns a pd.Series object
pd.eval("abs(df1) ** .5")  # Valid, returns a pd.DataFrame object

...等等.条件表达式也以相同的方式受支持.下面的语句都是有效表达式,将由引擎进行评估.

...and so on. Conditional expressions are also supported in the same way. The statements below are all valid expressions and will be evaluated by the engine.

pd.eval("df1 > df2")        
pd.eval("df1 > 5")    
pd.eval("df1 < df2 and df3 < df4")      
pd.eval("df1 in [1, 2, 3]")
pd.eval("1 < 2 < 3")

可以在文档.总之,

  • 除左移(<< )和右移(>> )运算符外的算术运算,例如 df + 2 *pi/s ** 4%42 -the_golden_ratio
  • 比较操作,包括链式比较,例如 2<df <df2
  • 布尔运算,例如 df<df2和df3<df4 not df_bool list tuple 文字,例如 [1、2] (1、2)
  • 属性访问,例如 df.a
  • 下标表达式,例如 df [0]
  • 简单的变量求值,例如 pd.eval('df')(这不是很有用)
  • 数学函数:sin,cos,exp,log,expm1,log1p,sqrt,sinh,cosh,tanh,arcsin,arccos,arctan,arcosh,arcsinh,arctanh,abs和arctan2.
  • Arithmetic operations except for the left shift (<<) and right shift (>>) operators, e.g., df + 2 * pi / s ** 4 % 42 - the_golden_ratio
  • Comparison operations, including chained comparisons, e.g., 2 < df < df2
  • Boolean operations, e.g., df < df2 and df3 < df4 or not df_bool list and tuple literals, e.g., [1, 2] or (1, 2)
  • Attribute access, e.g., df.a
  • Subscript expressions, e.g., df[0]
  • Simple variable evaluation, e.g., pd.eval('df') (this is not very useful)
  • Math functions: sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and arctan2.

文档的此部分还指定了不支持的语法规则,包括 set / dict 文字,if-else语句,循环和理解以及生成器表达式

This section of the documentation also specifies syntax rules that are not supported, including set/dict literals, if-else statements, loops, and comprehensions, and generator expressions.

从列表中可以明显看出,您还可以传递涉及索引的表达式,例如

From the list, it is obvious you can also pass expressions involving the index, such as

pd.eval('df1.A * (df1.index > 1)')

1a)解析器选择: parser = ... 参数

在解析表达式字符串以生成语法树时,

pd.eval 支持两个不同的解析器选项: pandas python .优先顺序略有不同,这突出表明了两者之间的主要区别.

1a) Parser Selection: The parser=... argument

pd.eval supports two different parser options when parsing the expression string to generate the syntax tree: pandas and python. The main difference between the two is highlighted by slightly differing precedence rules.

使用默认的解析器 pandas ,重载的按位运算符& | 与pandas对象一起实现矢量化AND和OR运算与 and or 相同的运算符优先级.所以,

Using the default parser pandas, the overloaded bitwise operators & and | which implement vectorized AND and OR operations with pandas objects will have the same operator precedence as and and or. So,

pd.eval("(df1 > df2) & (df3 < df4)")

将与

pd.eval("df1 > df2 & df3 < df4")
# pd.eval("df1 > df2 & df3 < df4", parser='pandas')

也与

pd.eval("df1 > df2 and df3 < df4")

在这里,括号是必需的.按照常规方式,需要使用parens来覆盖按位运算符的更高优先级:

Here, the parentheses are necessary. To do this conventionally, the parens would be required to override the higher precedence of bitwise operators:

(df1 > df2) & (df3 < df4)

否则,我们最终会结束

df1 > df2 & df3 < df4

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

如果要在评估字符串时保持与python实际运算符优先级规则的一致性,请使用 parser ='python'.

Use parser='python' if you want to maintain consistency with python's actual operator precedence rules while evaluating the string.

pd.eval("(df1 > df2) & (df3 < df4)", parser='python')

两种类型的解析器之间的另一个区别是带有列表和元组节点的 == != 运算符的语义,它们的语义与<使用'pandas'解析器时,分别在 in notin 中.例如,

The other difference between the two types of parsers are the semantics of the == and != operators with list and tuple nodes, which have the similar semantics as in and not in respectively, when using the 'pandas' parser. For example,

pd.eval("df1 == [1, 2, 3]")

有效,并且将以与

相同的语义运行

Is valid, and will run with the same semantics as

pd.eval("df1 in [1, 2, 3]")

OTOH, pd.eval("df1 == [1、2、3]",parser ='python')将引发 NotImplementedError 错误.

OTOH, pd.eval("df1 == [1, 2, 3]", parser='python') will throw a NotImplementedError error.

有两个选项- numexpr (默认)和 python . numexpr 选项使用针对性能进行了优化的 numexpr 后端.

There are two options - numexpr (the default) and python. The numexpr option uses the numexpr backend which is optimized for performance.

使用'python'后端,您可以对表达式进行求值,就像只是将表达式传递给python的 eval 函数一样.您可以灵活地执行更多内部表达式,例如字符串操作.

With 'python' backend, your expression is evaluated similar to just passing the expression to python's eval function. You have the flexibility of doing more inside expressions, such as string operations, for instance.

df = pd.DataFrame({'A': ['abc', 'def', 'abacus']})
pd.eval('df.A.str.contains("ab")', engine='python')

0     True
1    False
2     True
Name: A, dtype: bool

不幸的是,此方法与 numexpr 引擎相比,没有提供 no 性能优势,并且很少有安全措施可确保不对危险的表达式进行求值,因此自行承担风险!!除非您知道自己在做什么,否则通常不建议将此选项更改为'python'.

Unfortunately, this method offers no performance benefits over the numexpr engine, and there are very few security measures to ensure that dangerous expressions are not evaluated, so USE AT YOUR OWN RISK! It is generally not recommended to change this option to 'python' unless you know what you're doing.

有时,为表达式内部使用但当前未在名称空间中定义的变量提供值很有用.您可以将字典传递给 local_dict

Sometimes, it is useful to supply values for variables used inside expressions, but not currently defined in your namespace. You can pass a dictionary to local_dict

例如:

pd.eval("df1 > thresh")

UndefinedVariableError: name 'thresh' is not defined

此操作失败,因为未定义 thresh .但是,这可行:

This fails because thresh is not defined. However, this works:

pd.eval("df1 > thresh", local_dict={'thresh': 10})
    

当您要从字典中提供变量时,这很有用.或者,使用'python'引擎,您可以简单地执行以下操作:

This is useful when you have variables to supply from a dictionary. Alternatively, with the 'python' engine, you could simply do this:

mydict = {'thresh': 5}
# Dictionary values with *string* keys cannot be accessed without 
# using the 'python' engine.
pd.eval('df1 > mydict["thresh"]', engine='python')

但这比使用'numexpr'引擎并将字典传递给 local_dict global_dict .希望这应该为使用这些参数提供令人信服的论据.

But this is going to possibly be much slower than using the 'numexpr' engine and passing a dictionary to local_dict or global_dict. Hopefully, this should make a convincing argument for the use of these parameters.

这通常不是必需的,因为通常有更简单的方法可以执行此操作,但是您可以将 pd.eval 的结果分配给实现 __ getitem __ 的对象作为 dict s,以及(您猜对了)DataFrame.

This is not often a requirement because there are usually simpler ways of doing this, but you can assign the result of pd.eval to an object that implements __getitem__ such as dicts, and (you guessed it) DataFrames.

考虑问题中的示例

x = 5
df2['D'] = df1['A'] + (df1['B'] * x)

要为列分配"D"列,请执行以下操作:到 df2 ,我们这样做

To assign a column "D" to df2, we do

pd.eval('D = df1.A + (df1.B * x)', target=df2)

   A  B  C   D
0  5  9  8   5
1  4  3  0  52
2  5  0  2  22
3  8  1  3  48
4  3  7  0  42

这不是对 df2 的就地修改(但可以...继续阅读).请考虑另一个示例:

This is not an in-place modification of df2 (but it can be... read on). Consider another example:

pd.eval('df1.A + df2.A')

0    10
1    11
2     7
3    16
4    10
dtype: int32

例如,如果您想将此分配回一个DataFrame,则可以使用 target 参数,如下所示:

If you wanted to (for example) assign this back to a DataFrame, you could use the target argument as follows:

df = pd.DataFrame(columns=list('FBGH'), index=df1.index)
df
     F    B    G    H
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN

df = pd.eval('B = df1.A + df2.A', target=df)
# Similar to 
# df = df.assign(B=pd.eval('df1.A + df2.A'))

df
     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN

如果要在 df 上执行就地突变,请设置 inplace = True .

If you wanted to perform an in-place mutation on df, set inplace=True.

pd.eval('B = df1.A + df2.A', target=df, inplace=True)
# Similar to 
# df['B'] = pd.eval('df1.A + df2.A')

df
     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN

如果在没有目标的情况下设置了 inplace ,则会引发 ValueError .

If inplace is set without a target, a ValueError is raised.

虽然 target 参数很有趣,但是您几乎不需要使用它.

While the target argument is fun to play around with, you will seldom need to use it.

如果要使用 df.eval 进行此操作,则可以使用涉及赋值的表达式:

If you wanted to do this with df.eval, you would use an expression involving an assignment:

df = df.eval("B = @df1.A + @df2.A")
# df.eval("B = @df1.A + @df2.A", inplace=True)
df

     F   B    G    H
0  NaN  10  NaN  NaN
1  NaN  11  NaN  NaN
2  NaN   7  NaN  NaN
3  NaN  16  NaN  NaN
4  NaN  10  NaN  NaN

注意
pd.eval 的一种意外用途是以与 ast.literal_eval :

Note
One of pd.eval's unintended uses is parsing literal strings in a manner very similar to ast.literal_eval:

pd.eval("[1, 2, 3]")
array([1, 2, 3], dtype=object)

它还可以使用'python'引擎解析嵌套列表:

It can also parse nested lists with the 'python' engine:

pd.eval("[[1, 2, 3], [4, 5], [10]]", engine='python')
[[1, 2, 3], [4, 5], [10]]

以及字符串列表:

pd.eval(["[1, 2, 3]", "[4, 5]", "[10]"], engine='python')
[[1, 2, 3], [4, 5], [10]]

但是,问题在于长度大于100的列表:

The problem, however, is for lists with length larger than 100:

pd.eval(["[1]"] * 100, engine='python') # Works
pd.eval(["[1]"] * 101, engine='python') 

AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'

可以找到有关此错误,原因,修复和解决方法的更多信息,

More information can this error, causes, fixes, and workarounds can be found here.

如上所述, df.eval 在幕后调用 pd.eval ,并带有一些并列参数. v0.23源代码显示以下内容:

As mentioned above, df.eval calls pd.eval under the hood, with a bit of juxtaposition of arguments. The v0.23 source code shows this:

def eval(self, expr, inplace=False, **kwargs):

    from pandas.core.computation.eval import eval as _eval

    inplace = validate_bool_kwarg(inplace, 'inplace')
    resolvers = kwargs.pop('resolvers', None)
    kwargs['level'] = kwargs.pop('level', 0) + 1
    if resolvers is None:
        index_resolvers = self._get_index_resolvers()
        resolvers = dict(self.iteritems()), index_resolvers
    if 'target' not in kwargs:
        kwargs['target'] = self
    kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
    return _eval(expr, inplace=inplace, **kwargs)

eval 创建参数,进行一些验证,然后将参数传递给 pd.eval .

eval creates arguments, does a little validation, and passes the arguments on to pd.eval.

有关更多信息,您可以阅读:何时使用DataFrame.eval()与pandas.eval()或python eval()

For more, you can read on: when to use DataFrame.eval() versus pandas.eval() or python eval()

对于与整个DataFrame关联的动态查询,您应该首选 pd.eval .例如,当调用 df1.eval df2时,没有简单的方法来指定 pd.eval("df1 + df2")的等效项.评估.

For dynamic queries associated with entire DataFrames, you should prefer pd.eval. For example, there is no simple way to specify the equivalent of pd.eval("df1 + df2") when you call df1.eval or df2.eval.

另一个主要区别是如何访问列.例如,要添加两个列"A",则添加"A".和"B"表示在 df1 中,您将使用以下表达式调用 pd.eval :

Another other major difference is how columns are accessed. For example, to add two columns "A" and "B" in df1, you would call pd.eval with the following expression:

pd.eval("df1.A + df1.B")

使用df.eval,只需提供列名称:

With df.eval, you need only supply the column names:

df1.eval("A + B")

因为,在 df1 的上下文中,很明显"A"表示和"B"表示请参阅列名.

Since, within the context of df1, it is clear that "A" and "B" refer to column names.

您还可以使用 index 引用索引和列(除非索引被命名,在这种情况下,您将使用名称).

You can also refer to the index and columns using index (unless the index is named, in which case you would use the name).

df1.eval("A + index")

或更普遍地说,对于具有1个或多个级别索引的任何DataFrame,您可以使用变量"ilevel_k"在表达式中引用索引的第k 级; 代表" k级 i ndex".IOW,上面的表达式可以写为 df1.eval("A + ilevel_0&";).

Or, more generally, for any DataFrame with an index having 1 or more levels, you can refer to the kth level of the index in an expression using the variable "ilevel_k" which stands for "index at level k". IOW, the expression above can be written as df1.eval("A + ilevel_0").

这些规则也适用于 df.query .

表达式内提供的变量必须以"@"开头.符号,以避免与列名混淆.

Variables supplied inside expressions must be preceeded by the "@" symbol, to avoid confusion with column names.

A = 5
df1.eval("A > @A") 

查询也是如此.

不用说,您的列名必须遵循python中有效标识符命名的规则,以便可以在 eval 中进行访问.有关命名标识符的规则列表,请参见此处.

It goes without saying that your column names must follow the rules for valid identifier naming in python to be accessible inside eval. See here for a list of rules on naming identifiers.

一个鲜为人知的事实是 eval 支持处理赋值的多行表达式(而 query 不支持).例如,要创建两个新列"E",则创建"E"列.和"F"表示基于某些列上的某些算术运算来确定df1中的值,而第三列"G"则基于df1中的值.基于先前创建的"E"和"F",我们可以做到

A little known fact is that eval supports multiline expressions that deal with assignment (whereas query doesn't). For example, to create two new columns "E" and "F" in df1 based on some arithmetic operations on some columns, and a third column "G" based on the previously created "E" and "F", we can do

df1.eval("""
E = A + B
F = @df2.A + @df2.B
G = E >= F
""")

   A  B  C  D   E   F      G
0  5  0  3  3   5  14  False
1  7  9  3  5  16   7   True
2  2  4  7  6   6   5   True
3  8  8  1  6  16   9   True
4  7  7  8  1  14  10   True


3) eval query

df.query 视为使用 pd.eval 作为子例程的函数会有所帮助.


3) eval vs query

It helps to think of df.query as a function that uses pd.eval as a subroutine.

通常, query (顾名思义)用于评估条件表达式(即产生True/False值的表达式)并返回与 True 相对应的行.代码>结果.然后将表达式的结果传递给 loc (在大多数情况下)以返回满足表达式的行.根据文档,

Typically, query (as the name suggests) is used to evaluate conditional expressions (i.e., expressions that result in True/False values) and return the rows corresponding to the True result. The result of the expression is then passed to loc (in most cases) to return the rows that satisfy the expression. According to the documentation,

该表达式的求值结果首先传递给 DataFrame.loc ,如果由于多维键而失败(例如,DataFrame),则结果将传递到 DataFrame .__ getitem __().

The result of the evaluation of this expression is first passed to DataFrame.loc and if that fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed to DataFrame.__getitem__().

此方法使用顶级 pandas.eval()函数来评估传递的查询.

This method uses the top-level pandas.eval() function to evaluate the passed query.

就相似性而言, query df.eval 在访问列名和变量方面都是相似的.

In terms of similarity, query and df.eval are both alike in how they access column names and variables.

如上所述,两者之间的主要区别在于它们如何处理表达式结果.当您实际上通过这两个函数运行表达式时,这一点变得显而易见.例如,考虑

This key difference between the two, as mentioned above is how they handle the expression result. This becomes obvious when you actually run an expression through these two functions. For example, consider

df1.A

0    5
1    7
2    2
3    8
4    7
Name: A, dtype: int32

df1.B

0    9
1    3
2    0
3    1
4    7
Name: B, dtype: int32

获取所有行,其中"A"表示> ="B";在 df1 中,我们将这样使用 eval :

To get all rows where "A" >= "B" in df1, we would use eval like this:

m = df1.eval("A >= B")
m
0     True
1    False
2    False
3     True
4     True
dtype: bool

m 表示通过评估表达式"A> = B"生成的中间结果.然后,我们使用掩码过滤 df1 :

m represents the intermediate result generated by evaluating the expression "A >= B". We then use the mask to filter df1:

df1[m]
# df1.loc[m]

   A  B  C  D
0  5  0  3  3
3  8  8  1  6
4  7  7  8  1

但是,对于 query ,中间结果"m"直接传递给 loc ,因此使用 query ,您只需要做

However, with query, the intermediate result "m" is directly passed to loc, so with query, you would simply need to do

df1.query("A >= B")

   A  B  C  D
0  5  0  3  3
3  8  8  1  6
4  7  7  8  1

明智的选择,完全相同.

Performance wise, it is exactly the same.

df1_big = pd.concat([df1] * 100000, ignore_index=True)

%timeit df1_big[df1_big.eval("A >= B")]
%timeit df1_big.query("A >= B")

14.7 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.7 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

但是后者更为简洁,并且只需一步即可表达相同的操作.

But the latter is more concise, and expresses the same operation in a single step.

请注意,您也可以像这样使用 query 做一些奇怪的事情(例如,返回由df1.index索引的所有行)

Note that you can also do weird stuff with query like this (to, say, return all rows indexed by df1.index)

df1.query("index")
# Same as df1.loc[df1.index] # Pointless,... I know

   A  B  C  D
0  5  0  3  3
1  7  9  3  5
2  2  4  7  6
3  8  8  1  6
4  7  7  8  1

但是不要.

底线:根据条件表达式查询或过滤行时,请使用 query .

Bottom line: Please use query when querying or filtering rows based on a conditional expression.

这篇关于从大 pandas 中的公式动态评估表达?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆