根据一系列值的顺序对循环进行矢量化 [英] Vectorising a loop based on the order of values in a series

查看：84 发布时间：2020/5/18 21:36:12 python pandas performance numpy dataframe

本文介绍了根据一系列值的顺序对循环进行矢量化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这个问题是基于我回答的上一个问题.

This question is based on a previous question I answered.

输入如下:

Index   Results  Price
0       Buy      10
1       Sell     11
2       Buy      12
3       Neutral  13
4       Buy      14
5       Sell     15

我需要找到每个买卖顺序(忽略乱序的额外买卖价格)并计算价格差异.

I need to find every Buy-Sell sequence (ignoring extra Buy / Sell values out of sequence) and calculate the difference in Price.

所需的输出:

Index Results Price Difference
0     Buy     10    
1     Sell    11    1
2     Buy     12    
3     Neutral 13    
4     Buy     14    
5     Sell    15    3

我的解决方案很冗长，但似乎可行:

My solution is verbose but seems to work:

from numba import njit

@njit
def get_diffs(results, prices):
    res = np.full(prices.shape, np.nan)
    prev_one, prev_zero = True, False
    for i in range(len(results)):
        if prev_one and (results[i] == 0):
            price_start = prices[i]
            prev_zero, prev_one = True, False
        elif prev_zero and (results[i] == 1):
            res[i] = prices[i] - price_start
            prev_zero, prev_one = False, True
    return res

results = df['Results'].map({'Buy': 0, 'Sell': 1})

df['Difference'] = get_diffs(results.values, df['Price'].values)

有矢量化方法吗?我担心大量行的代码可维护性和性能.

Is there a vectorised method? I'm concerned about code maintainability and performance over a large number of rows.

基准测试代码:

df = pd.DataFrame.from_dict({'Index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5},
                             'Results': {0: 'Buy', 1: 'Sell', 2: 'Buy', 3: 'Neutral', 4: 'Buy', 5: 'Sell'},
                             'Price': {0: 10, 1: 11, 2: 12, 3: 13, 4: 14, 5: 15}})

df = pd.concat([df]*10**4, ignore_index=True)

def jpp(df):
    results = df['Results'].map({'Buy': 0, 'Sell': 1})    
    return get_diffs(results.values, df['Price'].values)

%timeit jpp(df)  # 7.99 ms ± 142 µs per loop

推荐答案

稍后，我将使用scipy和numpy编写一些替代方案，但这是一个简单明了的答案，只是提出了矢量化替代方案，尽管它仍然会落后numba在性能方面.

I'll write up some alternatives using scipy and numpy later, but here's a clear, straightforward answer just to propose a vectorized alternative, although this will still fall behind numba in terms of performance.

如果我对问题的理解正确，将出现一个购买"，然后出现许多可能的选择，然后最后出现一个出售"，并且您想找到第一个购买"和购买"之间的区别. 卖出".然后，另一个购买"将开始，依此类推.

If I'm understanding the problem correctly, a "Buy" will appear, followed by any number of possible alternatives, then finally a "Sell" will appear, and you want to find the difference between the first "Buy" and the "Sell". Then another "Buy" will start, etc.

您可以使用cumsum和shift创建要分组的系列:

You can create a Series to group with using cumsum and shift:

a = df.Results.shift().eq('Sell').cumsum()

0    0
1    0
2    1
3    1
4    1
5    1
Name: Results, dtype: int32

接下来，您可以使用agg查找每个组的第一个和最后一个值:

You can next find the first and last values per group using agg:

agr = df.groupby(a).Price.agg(['first', 'last'])

最后，我们可以使用loc分配给新列:

Finally, we can assign to a new column using loc:

df.loc[df.Results.eq('Sell'), 'Diff'] = agr['last'].sub(agr['first']).values

   Index  Results  Price  Diff
0      0      Buy     10   NaN
1      1     Sell     11   1.0
2      2      Buy     12   NaN
3      3  Neutral     13   NaN
4      4      Buy     14   NaN
5      5     Sell     15   3.0

性能

Performance

In [27]: df = pd.concat([df]*10**4, ignore_index=True)

In [28]: %%timeit
    ...: a = df.Results.shift().eq('Sell').cumsum()
    ...: agr = df.groupby(a).Price.agg(['first', 'last'])
    ...: df.loc[df.Results.eq('Sell'), 'Diff'] = agr['last'].sub(agr['first']).values
    ...:
17.6 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [29]: %%timeit
    ...: s=df.groupby('Results').cumcount()
    ...: df['Diff']=df.Price.groupby(s).diff().loc[df.Results.isin(['Buy','Sell'])]
    ...:
3.71 s ± 331 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

我实际上无法运行您的代码，得到了TypingError，所以我无法进行比较.

I can't actually run your code, I get a TypingError, so I can't compare.

这篇关于根据一系列值的顺序对循环进行矢量化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据一系列值的顺序对循环进行矢量化 [英] Vectorising a loop based on the order of values in a series

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

根据一系列值的顺序对循环进行矢量化 [英] Vectorising a loop based on the order of values in a series

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭