如何向量化(使用pandas/numpy)而不是使用嵌套的for循环 [英] How to vectorize (make use of pandas/numpy) instead of using a nested for loop

查看:392
本文介绍了如何向量化(使用pandas/numpy)而不是使用嵌套的for循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望有效地使用pandas(或numpy)代替带有if语句的嵌套for循环来解决特定问题.这是一个玩具版本:

I wish to efficiently use pandas (or numpy) instead of a nested for loop with an if statement to solve a particular problem. Here is a toy version:

假设我有以下两个数据框

Suppose I have the following two DataFrames

import pandas as pd
import numpy as np

dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)

dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)

现在,我希望遍历每个数据帧的每一行,并在满足特定条件的情况下乘以val.这段代码可以满足我的需求

Now I wish to loop through each row each dataframe and multiply the vals if a particular condition is met. This code works for what I want

ans = []

for i in range(len(df1)):
    for j in range(len(df2)):
        if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
            ans.append(df1['vals'][i]*df2['vals'][j])

np.sum(ans)

但是,很明显,这是非常的,效率低下,实际上我的DataFrames可以包含数百万个条目,因此无法使用.我也没有让我们使用pandasnumpy高效矢量实现.有谁知道如何有效地向量化此嵌套循环?

However, clearly this is very inefficient and in reality my DataFrames can have millions of entries making this unusable. I am also not making us of pandas or numpy efficient vector implementations. Does anyone have any ideas how to efficiently vectorize this nested loop?

我觉得这段代码类似于矩阵乘法,所以可以利用outer取得进步吗?我很难找到if条件,因为if逻辑需要将df1中的每个条目与df2中的所有条目进行比较.

I feel like this code is something akin to matrix multiplication so could progress be made utilising outer? It's the if condition that I'm finding hard to wedge in, as the if logic needs to compare each entry in df1 against all entries in df2.

推荐答案

您还可以使用Numba之类的编译器来完成此工作.这也将胜过矢量化解决方案,并且不需要临时数组.

You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.

示例

import numba as nb
import numpy as np
import pandas as pd
import time

@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
  sum=0.
  for i in nb.prange(len(df1_in)):
      for j in range(len(df2_in)):
          if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
              sum+=df1_vals[i]*df2_vals[j]
  return sum

测试

dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)

#first call has some compilation overhead
res=your_function(df1['in'].values,df1['out'].values,df1['vals'].values,df2['in'].values,df2['out'].values,df2['vals'].values)

t1=time.time()
for i in range(1000):
  res=your_function(df1['in'].values,df1['out'].values,df1['vals'].values,df2['in'].values,df2['out'].values,df2['vals'].values)
  #res_2=g(df1, df2)

print(time.time()-t1)

时间

vectorized solution @AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms

这篇关于如何向量化(使用pandas/numpy)而不是使用嵌套的for循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆