用大 pandas 数据框向量化复杂切片 [英] vectorize complex slicing with pandas dataframe
问题描述
为了速度,我希望能够向量化这段代码。目的是计算一个函数,在这种情况下是从两个单独的数组中引用的一对日期的元组的标准偏差。
import pandas as pd
import numpy as np
asd_1 = pd.Series 0.01 * np.random.randn(252),index = pd.date_range('2011-1-1',periods = 252)
index_1 = pd.to_datetime(['2011-2- 2','2011-4-3','2011-5-1',])
index_2 = pd.to_datetime(['2011-2-15','2011-4-16','2011 -5-17',])
index_tot = list(zip(index_1,index_2))
aux_learning_std = pd.DataFrame([np.nanstd(asd_1.loc [ i:j])for i,j in index_tot],index = index_1)
解决方案作品,通过一个循环执行,但我宁愿能够通过numpy /熊猫矢量化,这是更快。最初我虽然使用了以下的东西:
df_aux = pd.concat([asd_1 for _ in range(len(index_1)) ],axis = 1)
pre>
results = df_aux.apply(lambda x:np.nanstd(x.loc [i,j]),axis = 0)
但是在这里我无法将矢量放在一个操作中。
欢迎任何和所有的建议
ps:下面有一个用于解释的图片
解决方案
数组中范围的矢量化标准差
def get_ranges_arr(开始,结束):
#取自http://stackoverflow.com/a/37626057/3293881
计数=结束 - 开始
计数_csum = counts.cumsum()
id_arr = np.ones(counting_csum [-1],dtype = int)
id_arr [0] = starts [0]
id_arr [counting_csum [: - 1]] = starts [1:] - ends [: - 1] + 1
return id_arr.cumsum()
def ranged_std(arr,starts ,结束):
#获取所有索引和对应于同一个组的ID
idx = get_ranges_arr(开始,结束)
id_arr = np.repeat(np.arange(starts.size)
#提取相关数据
slice_arr = arr [idx]
#模拟一些组的标准差执行
#使用id_arr作为每组中执行各种数学运算
#的基础。既然,标准偏差执行总和/平均减少,
#我们可以简单地使用np.bincount来进行有效的实现。
#标准偏差公式:
#https://github.com/numpy/numpy/blob/v1.11.0/numpy/core/fromnumeric.py#L2939
grp_counts = np.bincount(id_arr)
mean_vals = np.bincount(id_arr,slice_arr)/ grp_counts
abs_vals = np.abs(slice_arr - mean_vals [id_arr])** 2
返回np.sqrt(np.bincount(id_arr,abs_vals )/ grp_counts)
示例运行(针对循环版本进行验证) / p>
在[173]中:arr = np.random.randint(0,9,(20))
在[174]中:starts = np.array([2,6,11])$ b
$ b在[175]中:ends = np.array([8,9,15])
在[176]中:[np.std(arr [i:j])for i,j in zip(starts,ends)]
输出[176]:[1.9720265943665387,0.81649658092772603,0.82915619758884995 ]
在[177]中:ranged_std(arr,starts,ends)
Out [177]:array([1.97202659,0.81649658,0.8291562])
运行时测试
案例1:非常小范围数量
3
在[21]中:arr = np.random.randint(0,9,(20))
在[22]中:starts = np.array([2,6,11])$ b
$ b在[23]中:ends = np.array([8,9,15])
在[24]中:%timeit [np.std(arr [i:j])for i,j in zip(starts,ends)]
10000循环,最好3:146μs循环
在[25]中:%timeit ranged_std(arr,启动,结束)
10000循环,最佳3:45μs每循环
案例#2:范围数量
1000
在[32]中:arr = np.random.randint(0,9,(1010))
pre>
在[33]中:starts = np.random.randint(0,9,(1000))
在[34]:ends = starts + np.random.randint(0,9,(1000))
在[35]中:%timeit [np.std(arr [i:j])for i,j in zip(starts,ends)]
10循环,最好3:47.5 ms每循环
在[36]:%timeit ranged_std(arr,开始,结束)
1000循环,最好3:217μs每循环
案例#3:大量范围
10000
在[60]中:arr = np.random.randint(0,9,(1010))
在[61]中:arr = np.random.randint(0,9,(10010))
在[62]中:starts = np.random.randint(0,9,(10000))
在[63]中:ends = starts + np.random。 randint(0,9,(10000))
在[64]中:%timeit [np.std(arr [i:j])for i,j in zip(starts,ends)]
1循环,最好3:474 ms每循环
在[65]:%timeit ranged_std(arr,启动,结束)
100循环,最好3:2.17 ms每个循环
真正惊人的加速
200x +
!
使用
ranged_std
解决我们的案例#获取开始,根据需要停止数字索引,以便稍后在$ b上获取范围数组$ b starts = asd_1.index.searchsorted(index_1)
ends = asd_1.index.searchsorted(index_2)
#使用ranged_std创建最终的数据帧输出func
df = pd。 DataFrame(ranged_std(asd_1.values,starts,ends + 1),index = index_1)
运行验证样本
在[17]中:asd_1 = pd .Series(0.01 * np.random.randn(252),index = \
...:pd.date_range('2011-1-1',periods = 252))
... :
...:index_1 = pd.to_datetime(['2011-2-2','2011-4-3','2011-5-1',])
...: index_2 = pd.to_datetime(['2011-2-15','2011-4-16','2011-5-17',])
...:
...:index_tot = list(zip(index_1,index_2))
...:aux_learning_std = pd.DataFrame([np.nanstd(asd_1.loc [i:j])for i,j in \
。 ..:index_tot],index = index_1)
...:
在[18]:starts = asd_1.index.searchsorted(index_1)
...:ends = asd_1.index.searchsorted(index_2)
...:df = pd.DataFrame(ranged_std(asd_1.values,starts,ends + 1),index = index_1)
...:
在[19]:aux_learning_std
输出[19]:
0
2011-02-02 0.0072 44
2011-04-03 0.012862
2011-05-01 0.010155
在[20]中:df
出[20]:
0
2011-02-02 0.007244
2011-04-03 0.012862
2011-05-01 0.010155
I'd like to be able to vectorize, for speed purposes, this piece of code. the purpose is to calculate a function, in this case a standard deviation, from a tuple of pair of dates that are cointained in two separate arrays.
import pandas as pd import numpy as np asd_1 = pd.Series(0.01 * np.random.randn(252), index=pd.date_range('2011-1-1', periods=252)) index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1',]) index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17',]) index_tot = list(zip(index_1,index_2)) aux_learning_std = pd.DataFrame([np.nanstd(asd_1.loc[i:j]) for i, j in index_tot], index=index_1)
the solution, that works, is performed through a loop but i'd rather be able to vectorize it through numpy/pandas, which is much faster. initially I though about using something like:
df_aux = pd.concat([asd_1 for _ in range(len(index_1))], axis=1) results = df_aux.apply(lambda x: np.nanstd(x.loc[i,j]), axis = 0)
but here I fail to put together the vectors into one operation.
any and all advice is welcome.
p.s.: below there is an image for explanatory purposes
解决方案Vectorized standard deviation across ranges in an array
def get_ranges_arr(starts,ends): # Taken from http://stackoverflow.com/a/37626057/3293881 counts = ends - starts counts_csum = counts.cumsum() id_arr = np.ones(counts_csum[-1],dtype=int) id_arr[0] = starts[0] id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1 return id_arr.cumsum() def ranged_std(arr,starts,ends): # Get all indices and the IDs corresponding to same groups idx = get_ranges_arr(starts,ends) id_arr = np.repeat(np.arange(starts.size),ends-starts) # Extract relevant data slice_arr = arr[idx] # Simulate standard deviation implementation for a number of groups # using id_arr as the basis to perform various mathematical operations # within each group. Since, std. deviation performs sum/mean reduction, # we can simply use np.bincount for an efficient implementation. # Std. deviation formula used : #https://github.com/numpy/numpy/blob/v1.11.0/numpy/core/fromnumeric.py#L2939 grp_counts = np.bincount(id_arr) mean_vals = np.bincount(id_arr,slice_arr)/grp_counts abs_vals = np.abs(slice_arr - mean_vals[id_arr])**2 return np.sqrt(np.bincount(id_arr,abs_vals)/grp_counts)
Sample run (verify against a loopy version)
In [173]: arr = np.random.randint(0,9,(20)) In [174]: starts = np.array([2,6,11]) In [175]: ends = np.array([8,9,15]) In [176]: [np.std(arr[i:j]) for i,j in zip(starts,ends)] Out[176]: [1.9720265943665387, 0.81649658092772603, 0.82915619758884995] In [177]: ranged_std(arr,starts,ends) Out[177]: array([ 1.97202659, 0.81649658, 0.8291562 ])
Runtime test
Case #1 : Very small number of ranges
3
In [21]: arr = np.random.randint(0,9,(20)) In [22]: starts = np.array([2,6,11]) In [23]: ends = np.array([8,9,15]) In [24]: %timeit [np.std(arr[i:j]) for i,j in zip(starts,ends)] 10000 loops, best of 3: 146 µs per loop In [25]: %timeit ranged_std(arr,starts,ends) 10000 loops, best of 3: 45 µs per loop
Case #2 : Decent number of ranges
1000
In [32]: arr = np.random.randint(0,9,(1010)) In [33]: starts = np.random.randint(0,9,(1000)) In [34]: ends = starts + np.random.randint(0,9,(1000)) In [35]: %timeit [np.std(arr[i:j]) for i,j in zip(starts,ends)] 10 loops, best of 3: 47.5 ms per loop In [36]: %timeit ranged_std(arr,starts,ends) 1000 loops, best of 3: 217 µs per loop
Case #3 : Large number of ranges
10000
In [60]: arr = np.random.randint(0,9,(1010)) In [61]: arr = np.random.randint(0,9,(10010)) In [62]: starts = np.random.randint(0,9,(10000)) In [63]: ends = starts + np.random.randint(0,9,(10000)) In [64]: %timeit [np.std(arr[i:j]) for i,j in zip(starts,ends)] 1 loops, best of 3: 474 ms per loop In [65]: %timeit ranged_std(arr,starts,ends) 100 loops, best of 3: 2.17 ms per loop
Really amazing speedups of
200x+
!
Using
ranged_std
to solve our case# Get start, stop numeric indices as needed for getting ranges array later on starts = asd_1.index.searchsorted(index_1) ends = asd_1.index.searchsorted(index_2) # Create final dataframe output using ranged_std func df = pd.DataFrame(ranged_std(asd_1.values,starts,ends+1),index=index_1)
Sample run for verification -
In [17]: asd_1 = pd.Series(0.01 * np.random.randn(252), index=\ ...: pd.date_range('2011-1-1', periods=252)) ...: ...: index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1',]) ...: index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17',]) ...: ...: index_tot = list(zip(index_1,index_2)) ...: aux_learning_std = pd.DataFrame([np.nanstd(asd_1.loc[i:j]) for i, j in \ ...: index_tot], index=index_1) ...: In [18]: starts = asd_1.index.searchsorted(index_1) ...: ends = asd_1.index.searchsorted(index_2) ...: df = pd.DataFrame(ranged_std(asd_1.values,starts,ends+1),index=index_1) ...: In [19]: aux_learning_std Out[19]: 0 2011-02-02 0.007244 2011-04-03 0.012862 2011-05-01 0.010155 In [20]: df Out[20]: 0 2011-02-02 0.007244 2011-04-03 0.012862 2011-05-01 0.010155
这篇关于用大 pandas 数据框向量化复杂切片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!