如何向量化(使用pandas/numpy)而不是使用嵌套的for循环 [英] How to vectorize (make use of pandas/numpy) instead of using a nested for loop
问题描述
我希望有效地使用pandas
(或numpy
)代替带有if
语句的嵌套for
循环来解决特定问题.这是一个玩具版本:
I wish to efficiently use pandas
(or numpy
) instead of a nested for
loop with an if
statement to solve a particular problem. Here is a toy version:
假设我有以下两个数据框
Suppose I have the following two DataFrames
import pandas as pd
import numpy as np
dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)
现在,我希望遍历每个数据帧的每一行,并在满足特定条件的情况下乘以val.这段代码可以满足我的需求
Now I wish to loop through each row each dataframe and multiply the vals if a particular condition is met. This code works for what I want
ans = []
for i in range(len(df1)):
for j in range(len(df2)):
if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
ans.append(df1['vals'][i]*df2['vals'][j])
np.sum(ans)
但是,很明显,这是非常的,效率低下,实际上我的DataFrames可以包含数百万个条目,因此无法使用.我也没有让我们使用pandas
或numpy
高效矢量实现.有谁知道如何有效地向量化此嵌套循环?
However, clearly this is very inefficient and in reality my DataFrames can have millions of entries making this unusable. I am also not making us of pandas
or numpy
efficient vector implementations. Does anyone have any ideas how to efficiently vectorize this nested loop?
我觉得这段代码类似于矩阵乘法,所以可以利用outer
取得进步吗?我很难找到if
条件,因为if
逻辑需要将df1
中的每个条目与df2
中的所有条目进行比较.
I feel like this code is something akin to matrix multiplication so could progress be made utilising outer
? It's the if
condition that I'm finding hard to wedge in, as the if
logic needs to compare each entry in df1
against all entries in df2
.
推荐答案
您还可以使用Numba之类的编译器来完成此工作.这也将胜过矢量化解决方案,并且不需要临时数组.
You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.
示例
import numba as nb
import numpy as np
import pandas as pd
import time
@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
sum=0.
for i in nb.prange(len(df1_in)):
for j in range(len(df2_in)):
if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
sum+=df1_vals[i]*df2_vals[j]
return sum
测试
dict1 = {'vals': np.random.randint(1,100,1000), 'in': np.random.randint(1,10,1000), 'out': np.random.randint(1,10,1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1,100,1500), 'in': 5*np.random.random(1500), 'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)
#first call has some compilation overhead
res=your_function(df1['in'].values,df1['out'].values,df1['vals'].values,df2['in'].values,df2['out'].values,df2['vals'].values)
t1=time.time()
for i in range(1000):
res=your_function(df1['in'].values,df1['out'].values,df1['vals'].values,df2['in'].values,df2['out'].values,df2['vals'].values)
#res_2=g(df1, df2)
print(time.time()-t1)
时间
vectorized solution @AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms
这篇关于如何向量化(使用pandas/numpy)而不是使用嵌套的for循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!