如何对使用其自身输出的滞后值的函数进行矢量化? [英] How can I vectorize a function that uses lagged values of its own output?

查看:163
本文介绍了如何对使用其自身输出的滞后值的函数进行矢量化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很抱歉这个问题的措词不好,但这是我能做的最好的。
我确切地知道我想要什么,但不知道如何要求它。



以下是一个例子所示的逻辑:



取值为1或0的两个条件触发一个也取值为1或0的信号。条件A触发信号(如果A = 1则信号= 1,否则信号= 0)无论如何。条件B不会触发信号,但如果条件B在条件A触发信号之后保持等于1
,则信号保持触发。
信号仅在A和A之后回到0 B已经回到0。



1。输入:





3。我尝试使用numpy.where():





4。可重复的代码段:

  #Settings 
import numpy as np
import pandas as pd
导入日期时间

#带输入和所需输出的数据帧i列signal_d
df = pd.DataFrame({'condition_A':list('00001100000110'),
'condition_B':list('01110011111000'),
'signal_d':list('00001111111110')})

colnames = list(df)
df [colnames] = df [colnames] .apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today()。strftime('%Y-%m-%d'),periods = 14).tolist( )
df ['dates'] = datelist
df = df.set_index(['dates'])

#解决方案在列signal_l $中使用嵌套ifs的for循环b $ b df ['signal_l'] = df ['condition_A']。copy(deep = True)
i = 0
df ['signal_l']中的观测值:如果是df,则为
。 ix [i,'condition_A'] == 1:
df.ix [i,'signal_l'] = 1
else:
#之前由condition_A
#触发的信号由条件_B保持活着:如果df.ix [
] i - 1,'signal_l']& df.ix [i,'condition_B'] == 1:
df.ix [i,'signal_l'] = 1
else:
df.ix [i,'signal_l'] = 0
i = i + 1



#我在word_v1列中的np.where尝试
df ['Signal_v1'] = df [' condition_A']。copy()
df ['Signal_v1'] = np.where(df.condition_A == 1,1,np.where((df.shift(1).Signal_v1 == 1)& (df.condition_B == 1),1,0))

print(df)

这很简单,使用带有滞后值的for循环和嵌套if句子,但我无法用像 numpy.where()。我知道这对于更大的数据帧会更快。



感谢您的任何建议!

解决方案

我认为没有办法对这个操作进行矢量化,这比Python循环快得多。 (至少,如果你想坚持使用Python,pandas和numpy,那就没有了。)



但是,您可以通过简化代码来提高此操作的性能。您的实现使用 if 语句和大量DataFrame索引。这些是相对昂贵的操作。



这是对脚本的修改,包括两个函数: add_signal_l(df) add_lagged(DF)。第一个是你的代码,只是包含在一个函数中。第二个使用更简单的函数来实现相同的结果 - 仍然是一个Python循环,但它使用numpy数组和按位运算符。

  import numpy as np 
import pandas as pd
import datetime

#---------------------- -------------------------------------------------
#创建测试DataFrame

#带输入和所需输出的数据帧i列signal_d
df = pd.DataFrame({'condition_A':list('00001100000110'),
'condition_B':list('01110011111000'),
'signal_d':list('00001111111110')})

colnames = list(df)
df [ colnames] = df [colnames] .apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today()。strftime('%Y-%m-%d'),periods = 14) .tolist()
df ['dates'] = datelist
df = df.set_index(['dates'])
#------------- -------------------------------------------------- --------

def add_signal_l(df):
#Solu使用带有嵌套ifs的for循环的信号在signal_l
df ['signal_l'] = df ['condition_A']。copy(deep = True)
i = 0
用于df中的观察[ 'signal_l']:
如果df.ix [i,'condition_A'] == 1:
df.ix [i,'signal_l'] = 1
else:
#之前由condition_A
#触发的信号由条件_B保持活着:
如果df.ix [i - 1,'signal_l']& df.ix [i,'condition_B'] == 1:
df.ix [i,'signal_l'] = 1
else:
df.ix [i,'signal_l'] = 0
i = i + 1

def compute_lagged_signal(a,b):
x = np.empty_like(a)
x [0] = a [0]
for i in range(1,len(a)):
x [i] = a [i] | (x [i-1]& b [i])
返回x

def add_lagged(df):
df ['lagged'] = compute_lagged_signal(df [' condition_A']。values,df ['condition_B']。values)

这是时间的比较这两个函数在IPython会话中运行:

 在[85]中:df 
Out [85]:
condition_A condition_B signal_d
dates
2017-06-09 0 0 0
2017-06-10 0 1 0
2017-06-11 0 1 0
2017-06-12 0 1 0
2017-06-13 1 0 1
2017-06-14 1 0 1
2017-06-15 0 1 1
2017-06-16 0 1 1
2017-06-17 0 1 1
2017-06-18 0 1 1
2017-06-19 0 1 1
2017-06-20 1 0 1
2017-06-21 1 0 1
2017-06-22 0 0 0

在[86 ]:%timeit add_signal_l(df)
每循环8.45 ms±177μs(平均值±标准差)。开发。 7次运行,每次100次循环)

在[87]中:%timeit add_lagged(df)
每循环137μs±581 ns(平均值±标准偏差,7次运行,10000次)循环每个)

如你所见, add_lagged(df)要快得多。


I'm sorry for the poor phrasing of the question, but it was the best I could do. I know exactly what I want, but not exactly how to ask for it.

Here is the logic demonstrated by an example:

Two conditions that take on the values 1 or 0 trigger a signal that also takes on the values 1 or 0. Condition A triggers the signal (If A = 1 then signal = 1, else signal = 0) no matter what. Condition B does NOT trigger the signal, but the signal stays triggered if condition B stays equal to 1 after the signal previously has been triggered by condition A. The signal goes back to 0 only after both A and B have gone back to 0.

1. Input:

2. Desired output (signal_d) and confirmation that a for loop can solve it (signal_l):

3. My attempt using numpy.where():

4. Reproducible snippet:

    # Settings
    import numpy as np
    import pandas as pd
    import datetime

    # Data frame with input and desired output i column signal_d
    df = pd.DataFrame({'condition_A':list('00001100000110'),
                       'condition_B':list('01110011111000'),
                       'signal_d':list('00001111111110')})

    colnames = list(df)
    df[colnames] = df[colnames].apply(pd.to_numeric)
    datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=14).tolist()
    df['dates'] = datelist
    df = df.set_index(['dates']) 

    # Solution using a for loop with nested ifs in column signal_l
    df['signal_l'] = df['condition_A'].copy(deep = True)
    i=0
    for observations in df['signal_l']:
        if df.ix[i,'condition_A'] == 1:
            df.ix[i,'signal_l'] = 1
        else:
            # Signal previously triggered by condition_A
            # AND kept "alive" by condition_B:                
            if df.ix[i - 1,'signal_l'] & df.ix[i,'condition_B'] == 1:
                 df.ix[i,'signal_l'] = 1
            else:
                df.ix[i,'signal_l'] = 0          
        i = i + 1



    # My attempt with np.where in column signal_v1
    df['Signal_v1'] = df['condition_A'].copy()
    df['Signal_v1'] = np.where(df.condition_A == 1, 1, np.where( (df.shift(1).Signal_v1 == 1) & (df.condition_B == 1), 1, 0))

    print(df)

This is pretty straight forward using a for loop with lagged values and nested if sentences, but I can't figure it out using vectorized functions like numpy.where(). And I know this would be much faster for bigger data frames.

Thank you for any suggestions!

解决方案

I don't think there is a way to vectorize this operation that will be significantly faster than a Python loop. (At least, not if you want to stick with just Python, pandas and numpy.)

However, you can improve the performance of this operation by simplifying your code. Your implementation uses if statements and a lot of DataFrame indexing. These are relatively costly operations.

Here's a modification of your script that includes two functions: add_signal_l(df) and add_lagged(df). The first is your code, just wrapped up in a function. The second uses a simpler function to achieve the same result--still a Python loop, but it uses numpy arrays and bitwise operators.

import numpy as np
import pandas as pd
import datetime

#-----------------------------------------------------------------------
# Create the test DataFrame

# Data frame with input and desired output i column signal_d
df = pd.DataFrame({'condition_A':list('00001100000110'),
                   'condition_B':list('01110011111000'),
                   'signal_d':list('00001111111110')})

colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=14).tolist()
df['dates'] = datelist
df = df.set_index(['dates']) 
#-----------------------------------------------------------------------

def add_signal_l(df):
    # Solution using a for loop with nested ifs in column signal_l
    df['signal_l'] = df['condition_A'].copy(deep = True)
    i=0
    for observations in df['signal_l']:
        if df.ix[i,'condition_A'] == 1:
            df.ix[i,'signal_l'] = 1
        else:
            # Signal previously triggered by condition_A
            # AND kept "alive" by condition_B:                
            if df.ix[i - 1,'signal_l'] & df.ix[i,'condition_B'] == 1:
                 df.ix[i,'signal_l'] = 1
            else:
                df.ix[i,'signal_l'] = 0          
        i = i + 1

def compute_lagged_signal(a, b):
    x = np.empty_like(a)
    x[0] = a[0]
    for i in range(1, len(a)):
        x[i] = a[i] | (x[i-1] & b[i])
    return x

def add_lagged(df):
    df['lagged'] = compute_lagged_signal(df['condition_A'].values, df['condition_B'].values)

Here's a comparison of the timing of the two function, run in an IPython session:

In [85]: df
Out[85]: 
            condition_A  condition_B  signal_d
dates                                         
2017-06-09            0            0         0
2017-06-10            0            1         0
2017-06-11            0            1         0
2017-06-12            0            1         0
2017-06-13            1            0         1
2017-06-14            1            0         1
2017-06-15            0            1         1
2017-06-16            0            1         1
2017-06-17            0            1         1
2017-06-18            0            1         1
2017-06-19            0            1         1
2017-06-20            1            0         1
2017-06-21            1            0         1
2017-06-22            0            0         0

In [86]: %timeit add_signal_l(df)
8.45 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [87]: %timeit add_lagged(df)
137 µs ± 581 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

As you can see, add_lagged(df) is much faster.

这篇关于如何对使用其自身输出的滞后值的函数进行矢量化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆