计算 pandas 系列*滚动*最大跌幅 [英] Compute *rolling* maximum drawdown of pandas Series

查看:350
本文介绍了计算 pandas 系列*滚动*最大跌幅的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是pretty的容易编写一个函数,计算时间序列的最大跌幅。这需要思考的一个小位在 O(N)的时间,而不是为O(n ^ 2)写时间。但它不是那么糟糕。这将工作:

It's pretty easy to write a function that computes the maximum drawdown of a time series. It takes a small bit of thinking to write it in O(n) time instead of O(n^2) time. But it's not that bad. This will work:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def max_dd(ser):
    max2here = pd.expanding_max(ser)
    dd2here = ser - max2here
    return dd2here.min()

让我们建立了一个简短的系列一起玩,跃跃欲试了:

Let's set up a brief series to play with to try it out:

np.random.seed(0)
n = 100
s = pd.Series(np.random.randn(n).cumsum())
s.plot()
plt.show()

正如预期的那样, max_dd(S)蜿蜒出对周围-17.6东西。好,伟大,盛大。现在说我很感兴趣,计算这一系列的滚动缩编。即每一步,我想计算从preceding子系列指定长度的最大跌幅。这是很容易做到使用 pd.rolling_apply 。它的工作原理是这样:

As expected, max_dd(s) winds up showing something right around -17.6. Good, great, grand. Now say I'm interested in computing the rolling drawdown of this Series. I.e. for each step, I want to compute the maximum drawdown from the preceding sub series of a specified length. This is easy to do using pd.rolling_apply. It works like so:

rolling_dd = pd.rolling_apply(s, 10, max_dd, min_periods=0)
df = pd.concat([s, rolling_dd], axis=1)
df.columns = ['s', 'rol_dd_10']
df.plot()

这完美的作品。但感觉很慢。是否有一个特别漂亮的算法,大熊猫或其他工具来做到这一点快?我带着一杆优势在写东西定制:它可以跟踪各种中间数据(观察到的最大值位置,previously发现提款的位置),以减少大量冗余计算。它节省一些时间,但不是一大堆,而不是远远不如应该是可能的。

This works perfectly. But it feels very slow. Is there a particularly slick algorithm in pandas or another toolkit to do this fast? I took a shot at writing something bespoke: it keeps track of all sorts of intermediate data (locations of observed maxima, locations of previously found drawdowns) to cut down on lots of redundant calculations. It does save some time, but not a whole lot, and not nearly as much as should be possible.

我想这是因为在Python / numpy的/大熊猫全部循环开销。但我目前没有足够地用Cython流利的真正知道如何开始,从这个角度攻击这一点。我希望有人之前曾经尝试这一点。或者,也许,有人可能想看看我的手工制作code,并愿意帮助我把它转换为用Cython。

I think it's because of all the looping overhead in Python/Numpy/Pandas. But I'm not currently fluent enough in Cython to really know how to begin attacking this from that angle. I was hoping someone had tried this before. Or, perhaps, that someone might want to have a look at my "handmade" code and be willing to help me convert it to Cython.

编辑: 谁想要都在这里提到的功能进行审查看看IPython的笔记本电脑在(和其他一些人!):<一href="http://nbviewer.ipython.org/gist/8one6/8506455">http://nbviewer.ipython.org/gist/8one6/8506455

For anyone who wants a review of all the functions mentioned here (and some others!) have a look at the iPython notebook at: http://nbviewer.ipython.org/gist/8one6/8506455

它显示了如何的一些方法对这一问题的联系,检查他们给了相同的结果,并显示了其运行时对不同大小的数据。

It shows how some of the approaches to this problem relate, checks that they give the same results, and shows their runtimes on data of various sizes.

如果有人有兴趣,在定制的算法我在我的文章提到的 rolling_dd_custom 。我认为这可能是一个非常快速的解决方案,如果实施在用Cython。

If anyone is interested, the "bespoke" algorithm I alluded to in my post is rolling_dd_custom. I think that could be a very fast solution if implemented in Cython.

推荐答案

下面的滚动最大跌幅功能的numpy的版本。 windowed_view 是一个在线的功能,使用 numpy.lib.stride_tricks.as_strided 的包装,以使内存使用效率一维数组的2D窗口视图(低于满code)。一旦我们有了这个窗口来看,计算基本相同,你的 max_dd ,但对于一个numpy的阵列写的,沿着第二轴应用(即轴= 1 )。

Here's a numpy version of the rolling maximum drawdown function. windowed_view is a wrapper of a one-line function that uses numpy.lib.stride_tricks.as_strided to make a memory efficient 2d windowed view of the 1d array (full code below). Once we have this windowed view, the calculation is basically the same as your max_dd, but written for a numpy array, and applied along the second axis (i.e. axis=1).

def rolling_max_dd(x, window_size, min_periods=1):
    """Compute the rolling maximum drawdown of `x`.

    `x` must be a 1d numpy array.
    `min_periods` should satisfy `1 <= min_periods <= window_size`.

    Returns an 1d array with length `len(x) - min_periods + 1`.
    """
    if min_periods < window_size:
        pad = np.empty(window_size - min_periods)
        pad.fill(x[0])
        x = np.concatenate((pad, x))
    y = windowed_view(x, window_size)
    running_max_y = np.maximum.accumulate(y, axis=1)
    dd = y - running_max_y
    return dd.min(axis=1)

下面是一个演示功能一个完整的脚本:

Here's a complete script that demonstrates the function:

import numpy as np
from numpy.lib.stride_tricks import as_strided
import pandas as pd
import matplotlib.pyplot as plt


def windowed_view(x, window_size):
    """Creat a 2d windowed view of a 1d array.

    `x` must be a 1d numpy array.

    `numpy.lib.stride_tricks.as_strided` is used to create the view.
    The data is not copied.

    Example:

    >>> x = np.array([1, 2, 3, 4, 5, 6])
    >>> windowed_view(x, 3)
    array([[1, 2, 3],
           [2, 3, 4],
           [3, 4, 5],
           [4, 5, 6]])
    """
    y = as_strided(x, shape=(x.size - window_size + 1, window_size),
                   strides=(x.strides[0], x.strides[0]))
    return y


def rolling_max_dd(x, window_size, min_periods=1):
    """Compute the rolling maximum drawdown of `x`.

    `x` must be a 1d numpy array.
    `min_periods` should satisfy `1 <= min_periods <= window_size`.

    Returns an 1d array with length `len(x) - min_periods + 1`.
    """
    if min_periods < window_size:
        pad = np.empty(window_size - min_periods)
        pad.fill(x[0])
        x = np.concatenate((pad, x))
    y = windowed_view(x, window_size)
    running_max_y = np.maximum.accumulate(y, axis=1)
    dd = y - running_max_y
    return dd.min(axis=1)


def max_dd(ser):
    max2here = pd.expanding_max(ser)
    dd2here = ser - max2here
    return dd2here.min()


if __name__ == "__main__":
    np.random.seed(0)
    n = 100
    s = pd.Series(np.random.randn(n).cumsum())

    window_length = 10

    rolling_dd = pd.rolling_apply(s, window_length, max_dd, min_periods=0)
    df = pd.concat([s, rolling_dd], axis=1)
    df.columns = ['s', 'rol_dd_%d' % window_length]
    df.plot(linewidth=3, alpha=0.4)

    my_rmdd = rolling_max_dd(s.values, window_length, min_periods=1)
    plt.plot(my_rmdd, 'g.')

    plt.show()

该图显示了您的code产生的曲线。绿点是由 rolling_max_dd 计算。

定时比较,用 N = 10000 window_length = 500

In [2]: %timeit rolling_dd = pd.rolling_apply(s, window_length, max_dd, min_periods=0)
1 loops, best of 3: 247 ms per loop

In [3]: %timeit my_rmdd = rolling_max_dd(s.values, window_length, min_periods=1)
10 loops, best of 3: 38.2 ms per loop

rolling_max_dd 要快约6.5倍。加速比是较小的窗口长度更好。例如, window_length = 200 ,这是快了近13倍。

rolling_max_dd is about 6.5 times faster. The speedup is better for smaller window lengths. For example, with window_length = 200, it is almost 13 times faster.

要处理不适用的,你可以preprocess的系列使用 fillna 方法传递数组之前 rolling_max_dd

To handle NA's, you could preprocess the Series using the fillna method before passing the array to rolling_max_dd.

这篇关于计算 pandas 系列*滚动*最大跌幅的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆