何时使用DataFrame.eval()与pandas.eval()或python eval() [英] when to use DataFrame.eval() versus pandas.eval() or python eval()

查看:385
本文介绍了何时使用DataFrame.eval()与pandas.eval()或python eval()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在DataFrame的〜1MM行上评估几十个条件(例如,foo > bar),最简洁的编写方式是将这些条件存储为字符串列表和创建布尔结果的DataFrame(每个记录一行x每个条件一行). (正在评估不是用户输入.)

在寻求过早优化的过程中,我试图确定是否应该在DataFrame中编写这些评估条件(例如,df.eval("foo > bar"),还是像在eval("df.foo > df.bar")

中那样将其留给python)

根据有关增强评估性能的文档 :

您不应将eval()用于简单表达式或表达式 涉及小的DataFrames.实际上,eval()是 对于较小的表达式/对象,幅度要比普通ol’慢 Python.一个好的经验法则是仅在有 具有10,000行以上的DataFrame.

能够使用df.eval("foo > bar")语法将是很好的,因为我的列表更具可读性,但是我找不到找到它的评估速度并不慢的情况.该文档显示了pandas.eval()比python eval()快(符合我的经验)的示例,但对于DataFrame.eval()(列为实验性")却没有.

例如,DataFrame.eval()仍然是大型DataFrame上不简单表达式中的明显输家:

import pandas as pd
import numpy as np
import numexpr
import timeit

someDf = pd.DataFrame({'a':np.random.uniform(size=int(1e6)), 'b':np.random.uniform(size=int(1e6))})

%timeit -n100 someDf.eval("a**b - a*b > b**a - b/a") # DataFrame.eval() on notional expression
%timeit -n100 eval("someDf['a']**someDf['b'] - someDf['a']*someDf['b'] > someDf['b']**someDf['a'] - someDf['b']/someDf['a']")
%timeit -n100 pd.eval("someDf.a**someDf.b - someDf.a*someDf.b > someDf.b**someDf.a - someDf.b/someDf.a")

100 loops, best of 3: 29.9 ms per loop
100 loops, best of 3: 18.7 ms per loop
100 loops, best of 3: 15.4 ms per loop

DataFrame.eval()的好处仅在于简化输入,还是可以确定使用此方法实际上更快的情况?

对于何时使用哪个eval(),还有其他指南吗? (我知道pandas.eval()不支持整套操作.)

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.2
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0

解决方案

那么DataFrame.eval()的好处仅仅是简化输入,还是可以确定使用此方法实际上更快的情况?

源代码用于DataFrame.eval()表明它实际上只是创建要传递给pd.eval()的参数:

def eval(self, expr, inplace=None, **kwargs):

    inplace = validate_bool_kwarg(inplace, 'inplace')
    resolvers = kwargs.pop('resolvers', None)
    kwargs['level'] = kwargs.pop('level', 0) + 1
    if resolvers is None:
        index_resolvers = self._get_index_resolvers()
        resolvers = dict(self.iteritems()), index_resolvers
    if 'target' not in kwargs:
        kwargs['target'] = self
    kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
    return _eval(expr, inplace=inplace, **kwargs)

_eval()只是pd.eval()的别名,它是在模块开头导入的:

from pandas.core.computation.eval import eval as _eval

因此,您可以使用df.eval()做的任何事情,可以使用pd.eval()做的事情以及一些额外的行来进行设置.就目前情况而言,df.eval()从未严格比pd.eval()快.但这并不意味着在任何情况下df.eval()都和pd.eval()一样好,但是编写起来更加方便.

但是,在尝试使用%prun魔术之后,看来df.eval()df._get_index_resolvers()的调用增加了df.eval()方法的时间.最终,_get_index_resolvers()最终调用了numpy.ndarray.copy()方法,这最终使事情变慢了.同时,pd.eval()确实在某个时候调用了numpy.ndarray.copy(),但是它花费的时间可以忽略不计(至少在我的机器上).

长话短说,似乎df.eval()往往比pd.eval()慢,因为在引擎盖下,它只是pd.eval(),带有额外的步骤,而这些步骤是不平凡的.

I have a few dozen conditions (e.g., foo > bar) that I need to evaluate on ~1MM rows of a DataFrame, and the most concise way of writing this is to store these conditions as a list of strings and create a DataFrame of boolean results (one row per record x one column per condition). (User input is not being evaluated.)

In the quest for premature optimization, I am trying to determine whether I should write these conditions for evaluation within DataFrame (e.g., df.eval("foo > bar") or just leave it to python as in eval("df.foo > df.bar")

According to the documentation on enhancing eval performance:

You should not use eval() for simple expressions or for expressions involving small DataFrames. In fact, eval() is many orders of magnitude slower for smaller expressions/objects than plain ol’ Python. A good rule of thumb is to only use eval() when you have a DataFrame with more than 10,000 rows.

It would be nice to be able to use the df.eval("foo > bar") syntax, because my list would be a little more readable, but I can't ever find a case where it's not slower to evaluate. The documentation shows examples of where pandas.eval() is faster than python eval() (which matches my experience) but none for DataFrame.eval() (which is listed as 'Experimental').

For example, DataFrame.eval() is still a clear loser in a not-simple expression on a large-ish DataFrame:

import pandas as pd
import numpy as np
import numexpr
import timeit

someDf = pd.DataFrame({'a':np.random.uniform(size=int(1e6)), 'b':np.random.uniform(size=int(1e6))})

%timeit -n100 someDf.eval("a**b - a*b > b**a - b/a") # DataFrame.eval() on notional expression
%timeit -n100 eval("someDf['a']**someDf['b'] - someDf['a']*someDf['b'] > someDf['b']**someDf['a'] - someDf['b']/someDf['a']")
%timeit -n100 pd.eval("someDf.a**someDf.b - someDf.a*someDf.b > someDf.b**someDf.a - someDf.b/someDf.a")

100 loops, best of 3: 29.9 ms per loop
100 loops, best of 3: 18.7 ms per loop
100 loops, best of 3: 15.4 ms per loop

So is the benefit of DataFrame.eval() merely in simplifying the input, or can we identify circumstances where using this method is actually faster?

Are there any other guidelines for when to use which eval()? (I'm aware that pandas.eval() does not support the complete set of operations.)

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.2
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0

解决方案

So is the benefit of DataFrame.eval() merely in simplifying the input, or can we identify circumstances where using this method is actually faster?

The source code for DataFrame.eval() shows that it actually just creates arguments to pass to pd.eval():

def eval(self, expr, inplace=None, **kwargs):

    inplace = validate_bool_kwarg(inplace, 'inplace')
    resolvers = kwargs.pop('resolvers', None)
    kwargs['level'] = kwargs.pop('level', 0) + 1
    if resolvers is None:
        index_resolvers = self._get_index_resolvers()
        resolvers = dict(self.iteritems()), index_resolvers
    if 'target' not in kwargs:
        kwargs['target'] = self
    kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
    return _eval(expr, inplace=inplace, **kwargs)

Where _eval() is just an alias for pd.eval() which is imported at the beginning of the module:

from pandas.core.computation.eval import eval as _eval

So anything that you can do with df.eval(), you could do with pd.eval() + a few extra lines to set things up. As things currently stand, df.eval() is never strictly faster than pd.eval(). But that doesn't mean there can't be cases where df.eval() is just as good as pd.eval(), yet more convenient to write.

However, after playing around with the %prun magic it appears that the call by df.eval() to df._get_index_resolvers() adds on a fair bit of time to the df.eval() method. Ultimately, _get_index_resolvers() ends up calling the .copy() method of numpy.ndarray, which is what ends up slowing things down. Meanwhile, pd.eval() does call numpy.ndarray.copy() at some point, but it takes a negligible amount of time (on my machine at least).

Long story short, it appears that df.eval() tends to be slower than pd.eval() because under the hood it's just pd.eval() with extra steps, and these steps are non-trivial.

这篇关于何时使用DataFrame.eval()与pandas.eval()或python eval()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆