如何矢量化使用数据帧的行和列元素的函数 [英] How to vectorize a function that uses both row and column elements of a dataframe

查看:39
本文介绍了如何矢量化使用数据帧的行和列元素的函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个数据框中有两个输入,我需要创建一个输出,该输出取决于两个输入(同一行、不同列),而且还取决于其前一个值(同一列、前一行).

这个数据框命令将创建一个我需要的例子:

df=pd.DataFrame([[0,0,0], [0,1,0], [0,0,0], [1,1,1], [0,1,1], [0,1,1], [0,0,0], [0,1,0], [0,1,0], [1,1,1], [1,1,1], [0,1,1], [0,1,1], [1,1,1], [0,1,1], [0,1,1], [0,0,0], [0,1,0]], columns=['input_1', 'input_2', 'output'])

规则很简单:

  • 如果 input_1 为 1,则输出为 1(input_1 是一个触发函数)
  • 只要 input_2 也是 1,输出就会保持为 1.(input_2 的工作方式有点像记忆函数)
  • 对于所有其他人,输出将为 0

行按时间顺序排列,我的意思是,第 0 行输出影响第 1 行输出,第 1 行输出影响第 2 行输出,依此类推.所以输出取决于input_1、input_2,也取决于它自己之前的值.

我可以编码它循环遍历数据帧,使用 iloc 计算和分配值,但它非常缓慢.我需要为数万个数据帧运行数千行,因此我正在寻找最有效的方法(最好是矢量化).它可以使用 numpy 或您知道的其他库/方法.

我搜索并发现了一些关于向量化和行循环的问题,但我仍然不知道如何使用这些技术.示例问题:如何迭代 DataFrame 中的行熊猫?.还有这个,什么是最用熊猫循环数据帧的有效方法?

感谢您的帮助

解决方案

正如您在上面的讨论中所解释的,我们只有两个使用 Pandas 数据框加载的输入:

df=pd.DataFrame([[0,0], [0,1], [0,0], [1,1], [0,1], [0,1], [0,0], [0,1], [0,1], [1,1], [1,1], [0,1], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1]], columns=['input_1', 'input_2'])

我们必须使用以下规则创建输出:

#1 如果 input_1 为 1,则输出为 1#2 如果两个输入都为零,则输出为零#3 如果 input_1 为零且 input_2 为一,则输出保持先前的值#4 初始输出值为零

我们可以生成输出

  1. 将 input_1 复制到输出
  2. 如果 input_1 为零且 input_2 为 1,则使用先前的值更新输出

因为上面的规则,我们不需要更新第一个输出

df['output'] = df.input_1对于 idx,df.iterrows() 中的行:如果 (idx > 0) 和 (row.input_1 == 0) 和 (row.input_2 == 1):df.output[idx] = df.output[idx-1]打印(df)

输出为:

<预><代码>>>>打印(df)input_1 input_2 输出0 0 0 01 0 1 02 0 0 03 1 1 14 0 1 15 0 1 16 0 0 07 0 1 08 0 1 09 1 1 110 1 1 111 0 1 112 0 1 113 1 1 114 0 1 115 0 1 116 0 0 017 0 1 0

更新1

更快捷的方法是修改@Andrej 提出的公式

df['output_2'] = (df['input_1'] + df['input_2'] * 2).replace(2, np.nan).ffill().replace(3, 1).astype(int)

未经修改,他的公式会为输入组合 [1, 0] 创建错误的输出.它保存先前的输出,而不是将其设置为 1.

更新2

这只是为了比较结果

df=pd.DataFrame([[0,0], [1,0], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1], [0,1], [1,1], [1,1], [0,1], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1]], columns=['input_1', 'input_2'])df['输出'] = df.input_1对于 idx,df.iterrows() 中的行:如果 (idx > 0) 和 (row.input_1 == 0) 和 (row.input_2 == 1):df.output[idx] = df.output[idx-1]df['output_1'] = (df['input_1'] + df['input_2'] * 2).replace(2, np.nan).ffill().replace(3, 1).astype(int)df['output_2'] = (df['input_1'] + df['input_2']).replace(1, np.nan).ffill().replace(2, 1).astype(int)打印(df)

结果是:

<预><代码>>>>打印(df)input_1 input_2 输出 output_1 output_20 0 0 0 0 01 1 0 1 1 02 0 1 1 1 03 1 1 1 1 14 0 1 1 1 15 0 1 1 1 16 0 0 0 0 07 0 1 0 0 08 0 1 0 0 09 1 1 1 1 110 1 1 1 1 111 0 1 1 1 112 0 1 1 1 113 1 1 1 1 114 0 1 1 1 115 0 1 1 1 116 0 0 0 0 017 0 1 0 0 0

I have two inputs in a dataframe, and I need to create an output that depends on both inputs (same row, different columns), but also on its previous value (same column, previous row).

This dataframe command will create an example of what I need:

df=pd.DataFrame([[0,0,0], [0,1,0], [0,0,0], [1,1,1], [0,1,1], [0,1,1], [0,0,0], [0,1,0], [0,1,0], [1,1,1], [1,1,1], [0,1,1], [0,1,1], [1,1,1], [0,1,1], [0,1,1], [0,0,0], [0,1,0]], columns=['input_1', 'input_2', 'output'])

The rules are simple:

  • If input_1 is 1, output is 1 (input_1 is a trigger function)
  • output will remain as 1 as long as input_2 is also 1. (input_2 works kind of like a memory function)
  • For all the others, output will be 0

The rows go in sequence as they happen in time, I mean, row 0 output influences row 1 output, row 1 output influences row 2 output, and so on. So output depends on input_1, input_2, but also on its own previous value.

I could code it looping through the dataframe, computing and assigning values using iloc, but it is painfully slow. I need to run this through many thousands of rows for tens of thousands of dataframes, so I am looking for the most efficient way to do it (preferably vectorization). It can be with numpy or other library/method that you know.

I searched and found some questions about vectorization and row-looping, but I still don't see how to use those techniques. Example questions: How to iterate over rows in a DataFrame in Pandas?. Also this one, What is the most efficient way to loop through dataframes with pandas?

I appreciate your help

解决方案

As you explained in the discussion above we have just two inputs loaded using pandas dataframe:

df=pd.DataFrame([[0,0], [0,1], [0,0], [1,1], [0,1], [0,1], [0,0], [0,1], [0,1], [1,1], [1,1], [0,1], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1]], columns=['input_1', 'input_2'])

We have to create outputs using following rules:

#1 if input_1 is one the output is one
#2 if both inputs is zero the output is zero
#3 if input_1 is zero and input_2 is one the output holds the previous value
#4 the initial output value is zero

to generate outputs we can

  1. duplicate input_1 to the output
  2. update output with previous value if input_1 is zero and input_2 is one

because of the rules above we don't need to update the first output

df['output'] = df.input_1

for idx, row in df.iterrows():
   if (idx > 0) and (row.input_1 == 0) and (row.input_2 == 1):
       df.output[idx] = df.output[idx-1]

print(df)

The output is:

>>> print(df)
    input_1  input_2  output
0         0        0       0
1         0        1       0
2         0        0       0
3         1        1       1
4         0        1       1
5         0        1       1
6         0        0       0
7         0        1       0
8         0        1       0
9         1        1       1
10        1        1       1
11        0        1       1
12        0        1       1
13        1        1       1
14        0        1       1
15        0        1       1
16        0        0       0
17        0        1       0

UPDATE1

The more fast way to do it is modification of formula proposed by @Andrej

df['output_2'] = (df['input_1'] + df['input_2'] * 2).replace(2, np.nan).ffill().replace(3, 1).astype(int)

Without modification his formula creates wrong output for input combination [1, 0]. It holds the previous output instead of setting it to 1.

UPDATE2

This just to compare results

df=pd.DataFrame([[0,0], [1,0], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1], [0,1], [1,1], [1,1], [0,1], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1]], columns=['input_1', 'input_2'])

df['output'] = df.input_1
for idx, row in df.iterrows():
   if (idx > 0) and (row.input_1 == 0) and (row.input_2 == 1):
       df.output[idx] = df.output[idx-1]

df['output_1'] = (df['input_1'] + df['input_2'] * 2).replace(2, np.nan).ffill().replace(3, 1).astype(int)
df['output_2'] = (df['input_1'] + df['input_2']).replace(1, np.nan).ffill().replace(2, 1).astype(int)
print(df)

The results is:

>>> print(df)
    input_1  input_2  output  output_1  output_2
0         0        0       0         0         0
1         1        0       1         1         0
2         0        1       1         1         0
3         1        1       1         1         1
4         0        1       1         1         1
5         0        1       1         1         1
6         0        0       0         0         0
7         0        1       0         0         0
8         0        1       0         0         0
9         1        1       1         1         1
10        1        1       1         1         1
11        0        1       1         1         1
12        0        1       1         1         1
13        1        1       1         1         1
14        0        1       1         1         1
15        0        1       1         1         1
16        0        0       0         0         0
17        0        1       0         0         0

这篇关于如何矢量化使用数据帧的行和列元素的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆