根据一系列值的顺序对循环进行矢量化 [英] Vectorising a loop based on the order of values in a series
问题描述
这个问题是基于我回答的上一个问题.
This question is based on a previous question I answered.
输入如下:
Index Results Price
0 Buy 10
1 Sell 11
2 Buy 12
3 Neutral 13
4 Buy 14
5 Sell 15
我需要找到每个买卖顺序(忽略乱序的额外买卖价格)并计算价格差异.
I need to find every Buy-Sell sequence (ignoring extra Buy / Sell values out of sequence) and calculate the difference in Price.
所需的输出:
Index Results Price Difference
0 Buy 10
1 Sell 11 1
2 Buy 12
3 Neutral 13
4 Buy 14
5 Sell 15 3
我的解决方案很冗长,但似乎可行:
My solution is verbose but seems to work:
from numba import njit
@njit
def get_diffs(results, prices):
res = np.full(prices.shape, np.nan)
prev_one, prev_zero = True, False
for i in range(len(results)):
if prev_one and (results[i] == 0):
price_start = prices[i]
prev_zero, prev_one = True, False
elif prev_zero and (results[i] == 1):
res[i] = prices[i] - price_start
prev_zero, prev_one = False, True
return res
results = df['Results'].map({'Buy': 0, 'Sell': 1})
df['Difference'] = get_diffs(results.values, df['Price'].values)
有矢量化方法吗?我担心大量行的代码可维护性和性能.
Is there a vectorised method? I'm concerned about code maintainability and performance over a large number of rows.
基准测试代码:
df = pd.DataFrame.from_dict({'Index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5},
'Results': {0: 'Buy', 1: 'Sell', 2: 'Buy', 3: 'Neutral', 4: 'Buy', 5: 'Sell'},
'Price': {0: 10, 1: 11, 2: 12, 3: 13, 4: 14, 5: 15}})
df = pd.concat([df]*10**4, ignore_index=True)
def jpp(df):
results = df['Results'].map({'Buy': 0, 'Sell': 1})
return get_diffs(results.values, df['Price'].values)
%timeit jpp(df) # 7.99 ms ± 142 µs per loop
推荐答案
稍后,我将使用scipy和numpy编写一些替代方案,但这是一个简单明了的答案,只是提出了矢量化替代方案,尽管它仍然会落后numba
在性能方面.
I'll write up some alternatives using scipy and numpy later, but here's a clear, straightforward answer just to propose a vectorized alternative, although this will still fall behind numba
in terms of performance.
如果我对问题的理解正确,将出现一个购买",然后出现许多可能的选择,然后最后出现一个出售",并且您想找到第一个购买"和购买"之间的区别. 卖出".然后,另一个购买"将开始,依此类推.
If I'm understanding the problem correctly, a "Buy" will appear, followed by any number of possible alternatives, then finally a "Sell" will appear, and you want to find the difference between the first "Buy" and the "Sell". Then another "Buy" will start, etc.
您可以使用cumsum
和shift
创建要分组的系列:
You can create a Series to group with using cumsum
and shift
:
a = df.Results.shift().eq('Sell').cumsum()
0 0
1 0
2 1
3 1
4 1
5 1
Name: Results, dtype: int32
接下来,您可以使用agg
查找每个组的第一个和最后一个值:
You can next find the first and last values per group using agg
:
agr = df.groupby(a).Price.agg(['first', 'last'])
最后,我们可以使用loc
分配给新列:
Finally, we can assign to a new column using loc
:
df.loc[df.Results.eq('Sell'), 'Diff'] = agr['last'].sub(agr['first']).values
Index Results Price Diff
0 0 Buy 10 NaN
1 1 Sell 11 1.0
2 2 Buy 12 NaN
3 3 Neutral 13 NaN
4 4 Buy 14 NaN
5 5 Sell 15 3.0
性能
Performance
In [27]: df = pd.concat([df]*10**4, ignore_index=True)
In [28]: %%timeit
...: a = df.Results.shift().eq('Sell').cumsum()
...: agr = df.groupby(a).Price.agg(['first', 'last'])
...: df.loc[df.Results.eq('Sell'), 'Diff'] = agr['last'].sub(agr['first']).values
...:
17.6 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [29]: %%timeit
...: s=df.groupby('Results').cumcount()
...: df['Diff']=df.Price.groupby(s).diff().loc[df.Results.isin(['Buy','Sell'])]
...:
3.71 s ± 331 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
我实际上无法运行您的代码,得到了TypingError
,所以我无法进行比较.
I can't actually run your code, I get a TypingError
, so I can't compare.
这篇关于根据一系列值的顺序对循环进行矢量化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!