如何在Pandas中创建lazy_evaluated数据框列 [英] How to create lazy_evaluated dataframe columns in Pandas

查看:206
本文介绍了如何在Pandas中创建lazy_evaluated数据框列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很多时候,我有一个大的数据框df来保存基本数据,并且需要创建更多的列来保存由基本数据列计算出的派生数据.

A lot of times, I have a big dataframe df to hold the basic data, and need to create many more columns to hold the derivative data calculated by basic data columns.

我可以在Pandas中做到这一点,

I can do that in Pandas like:

df['derivative_col1'] = df['basic_col1'] + df['basic_col2']
df['derivative_col2'] = df['basic_col1'] * df['basic_col2']
....
df['derivative_coln'] = func(list_of_basic_cols)

等熊猫将一次为所有派生列计算并分配内存.

etc. Pandas will calculate and allocate the memory for all derivative columns all at once.

我现在想要的是一种懒惰的评估机制,可以将派生列的计算和内存分配推迟到实际需要的时刻.将lazy_eval_columns定义为:

What I want now is to have a lazy evaluation mechanism to postpone the calculation and memory allocation of derivative columns to the actual need moment. Somewhat define the lazy_eval_columns as:

df['derivative_col1'] = pandas.lazy_eval(df['basic_col1'] + df['basic_col2'])
df['derivative_col2'] = pandas.lazy_eval(df['basic_col1'] * df['basic_col2'])

这将像Python的"yield"生成器那样节省时间/内存,因为如果我发出df['derivative_col2']命令将仅触发特定的计算和内存分配.

That will save the time/memory like Python 'yield' generator, for if I issue df['derivative_col2'] command will only triger the specific calculation and memory allocation.

那么在熊猫里怎么做lazy_eval()?欢迎任何提示/想法/参考.

So how to do lazy_eval() in Pandas ? Any tip/thought/ref are welcome.

推荐答案

从0.13开始(很快发布),您可以执行以下操作.这是使用生成器来评估动态公式.通过eval进行的在线分配是0.13中的一项附加功能,请参见此处

Starting in 0.13 (releasing very soon), you can do something like this. This is using generators to evaluate a dynamic formula. In-line assignment via eval will be an additional feature in 0.13, see here

In [19]: df = DataFrame(randn(5, 2), columns=['a', 'b'])

In [20]: df
Out[20]: 
          a         b
0 -1.949107 -0.763762
1 -0.382173 -0.970349
2  0.202116  0.094344
3 -1.225579 -0.447545
4  1.739508 -0.400829

In [21]: formulas = [ ('c','a+b'), ('d', 'a*c')]

创建一个生成器,该生成器使用eval来计算公式;分配结果,然后得出结果.

Create a generator that evaluates a formula using eval; assigns the result, then yields the result.

In [22]: def lazy(x, formulas):
   ....:     for col, f in formulas:
   ....:         x[col] = x.eval(f)
   ....:         yield x
   ....:         

实际行动

In [23]: gen = lazy(df,formulas)

In [24]: gen.next()
Out[24]: 
          a         b         c
0 -1.949107 -0.763762 -2.712869
1 -0.382173 -0.970349 -1.352522
2  0.202116  0.094344  0.296459
3 -1.225579 -0.447545 -1.673123
4  1.739508 -0.400829  1.338679

In [25]: gen.next()
Out[25]: 
          a         b         c         d
0 -1.949107 -0.763762 -2.712869  5.287670
1 -0.382173 -0.970349 -1.352522  0.516897
2  0.202116  0.094344  0.296459  0.059919
3 -1.225579 -0.447545 -1.673123  2.050545
4  1.739508 -0.400829  1.338679  2.328644

因此,它的用户确定了评估的顺序(并非按需).从理论上讲,numba将支持此功能,因此熊猫可能会将其作为eval(目前使用numexpr进行即时评估)的后端.

So its user determined ordering for the evaluation (and not on-demand). In theory numba is going to support this, so pandas possibly support this as a backend for eval (which currently uses numexpr for immediate evaluation).

我的2c.

惰性评估是不错的选择,但可以使用python自己的延续/生成功能轻松实现,因此,将其构建到熊猫中虽然很困难,但通常需要一个非常好的用例来实现.

lazy evaluation is nice, but can easily be achived by using python's own continuation/generate features, so building it into pandas, while possible, is quite tricky, and would need a really nice usecase to be generally useful.

这篇关于如何在Pandas中创建lazy_evaluated数据框列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆