pandas 的EWMA协方差矩阵-优化 [英] EWMA Covariance Matrix in Pandas - Optimization

查看:114
本文介绍了 pandas 的EWMA协方差矩阵-优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Pandas从股价收益数据框架计算EWMA协方差矩阵,并遵循 PyPortfolioOpt .

我喜欢使用Pandas对象和函数的灵活性,但是当资产集增长时,功能会变得很慢:

 将pandas导入为pd将numpy导入为npdef ewma_cov_pairwise_pd(x,y,alpha = 0.06):x = x.mask(y.isnull(),np.nan)y = y.mask(x.isnull(),np.nan)协变=((x-x.mean())*(y-y.mean()).dropna()return covariation.ewm(alpha = 0.06).mean().iloc [-1]def ewma_cov_pd(rets,alpha = 0.06):资产= rets.columnsn = len(资产)cov = np.zeros((n,n))对于范围(n)中的i:对于范围(i,n)中的j:cov [i,j] = cov [j,i] = ewma_cov_pairwise_pd(rets.iloc [:, i],rets.iloc [:, j],alpha = alpha)返回pd.DataFrame(cov,列=资产,索引=资产) 

我想在仍然使用Pandas的情况下理想地提高代码速度,但是瓶颈在使用90%计算时间的DataFrame.ewm()函数之内.

如果使用此功能是绑定约束,那么提高代码运行速度的最有效方法是什么?我正在考虑采用蛮力方法,并使用current.futures.ProcessPoolExecutor但也许有更好的解决方案.

  n = 100#n通常为2000rets = pd.DataFrame(np.random.normal(0,1.,size =(n,n)))cov_pd = ewma_cov_pd(rets) 

真实的时间序列数据可以包含前导空值和之后的潜在缺失值,尽管后者的可能性较小.

更新我

利用Quang Hoang提供的答案并在更合理的时间内产生预期结果的潜在解决方案将类似于:

  def ewma_cov_frame_qh(rets,alpha = 0.06):权重=(1-alpha)** np.arange(len(df))[::-1]标准化=(rets-rets.mean()).to_numpy()out =(权重*归一化的T)@归一化的/weights.sum()返回pd.DataFrame(out,index = rets.columns,column = rets.columns)def ewma_cov_qh(rets,alpha = 0.06):syms = rets.columnscovar = pd.DataFrame(索引= rets.columns,列= rets.columns)增量= rets.isnull().sum(轴= 1).shift(1)-rets.isnull().sum(轴= 1)日期= delta.loc [delta!= 0] .index.tolist()日期中的日期:frame = rets.loc [rets.index> = date] .dropna(axis = 1,how ='any')cov = ewma_cov_frame_qh(frame).reindex(index = syms,column = syms)covar = covar.fillna(cov)返回科瓦cov_qh = ewma_cov_qh(rets) 

这违反了使用本机Pandas/Numpy函数计算基础协方差的要求,并且计算时间将取决于数据集中前导na的数量.

更新II

下面列出了对上述方法的潜在改进,该改进使用了多处理(一个简单的实现),并且在我的机器上将计算时间进一步缩短了42.5%:

  fromcurrent.futures导入ProcessPoolExecutor,已完成从functools导入部分def ewma_cov_mp_worker(date,rets,alpha = 0.06):syms = rets.columnsframe = rets.loc [rets.index> = date] .dropna(axis = 1,how ='any')返回ewma_cov_frame_qh(frame,alpha = alpha).reindex(index = syms,column = syms)def ewma_cov_mp(rets,alpha = 0.06):covar = pd.DataFrame(索引= rets.columns,列= rets.columns)增量= rets.isnull().sum(轴= 1).shift(1)-rets.isnull().sum(轴= 1)日期= delta.loc [delta!= 0] .index.tolist()func =部分(ewma_cov_mp_worker,rets = rets,alpha = alpha)covs = {}使用ProcessPoolExecutor(max_workers = 6)作为exec:future_to_date = {exec.submit(func,date):日期中的日期}covs = {future_to_date [future]:future.result()用于as_completed(future_to_date)中的未来}日期中的日期:covar.fillna(covs [date],inplace = True)返回科瓦 

[我还没有添加答案,因为没有解决原始问题,我很乐观地找到更好的解决方案.]

解决方案

因为您并不真正在意 ewm ,也就是说,您只取了最后一个值.我们可以尝试矩阵乘法:

  def ewma(df,alpha = 0.94):权重=(1-alpha)** np.arange(len(df))[::-1]#fillna此处为0标准化=(df-df.mean()).fillna(0).to_numpy()out =((权重*归一化.T)@归一化/weights.sum()返回# 核实out = ewma(df)print(out [0,1] == ewma_cov_pairwise(df [0],df [1]))# 真的 

这在我的系统上用 df.shape ==(2000,2000)花费了大约 150 ms ,而您的代码拒绝在几分钟内运行:-)./p>

I would like to calculate the EWMA Covariance Matrix from a DataFrame of stock price returns using Pandas and have followed the methodology in PyPortfolioOpt.

I like the flexibility of using Pandas objects and functions but when the set of assets grows the function is becomes very slow:

import pandas as pd
import numpy as np

def ewma_cov_pairwise_pd(x, y, alpha=0.06):
    x = x.mask(y.isnull(), np.nan)
    y = y.mask(x.isnull(), np.nan)
    covariation = ((x - x.mean()) * (y - y.mean()).dropna()
    return covariation.ewm(alpha=0.06).mean().iloc[-1]

def ewma_cov_pd(rets, alpha=0.06):
    assets = rets.columns
    n = len(assets)
    cov = np.zeros((n, n))
    for i in range(n):
        for j in range(i, n):
            cov[i, j] = cov[j, i] = ewma_cov_pairwise_pd(
                rets.iloc[:, i], rets.iloc[:, j], alpha=alpha)
    return pd.DataFrame(cov, columns=assets, index=assets)

I would like to improve the speed of the code ideally while still using Pandas but the bottleneck is within the DataFrame.ewm() function which uses 90% of the calculation time.

If using this function was a binding constraint, what is the most efficient way of improving the speed at which the code runs? I was considering taking a brute force approach and using concurrent.futures.ProcessPoolExecutor but perhaps there is a better solutions.

n = 100  # n is typically 2000
rets = pd.DataFrame(np.random.normal(0, 1., size=(n, n)))
cov_pd = ewma_cov_pd(rets)

The true time-series data can contain leading nulls and potentially missing values after that although the latter less likely.

Update I

A potential solution which leverages off the answer provided by Quang Hoang and produces the expected results in a far more reasonable time would be something similar to:

def ewma_cov_frame_qh(rets, alpha=0.06):
    weights = (1-alpha) ** np.arange(len(df))[::-1]
    normalized = (rets-rets.mean()).to_numpy()    
    out = (weights * normalized.T) @ normalized / weights.sum()
    return pd.DataFrame(out, index=rets.columns, columns=rets.columns)


def ewma_cov_qh(rets, alpha=0.06):
    syms = rets.columns
    covar = pd.DataFrame(index=rets.columns, columns=rets.columns)
    delta = rets.isnull().sum(axis=1).shift(1) - rets.isnull().sum(axis=1)
    dates = delta.loc[delta != 0].index.tolist()
     
    for date in dates:
        frame = rets.loc[rets.index >= date].dropna(axis=1, how='any')
        cov = ewma_cov_frame_qh(frame).reindex(index=syms, columns=syms)
        covar = covar.fillna(cov)
   
    return covar

cov_qh = ewma_cov_qh(rets)

This violates the requirement that the underlying covariance is calculated using the native Pandas/Numpy functions and calculation time will depend on the number leading na's in the data set.

Update II

A potential improvement on the above which uses (a naive implementation of) multiprocessing and improves the calculation time by a further 42.5% on my machine is listed below:

from concurrent.futures import ProcessPoolExecutor, as_completed
from functools import partial
    
def ewma_cov_mp_worker(date, rets, alpha=0.06):
    syms = rets.columns
    frame = rets.loc[rets.index >= date].dropna(axis=1, how='any')
    return ewma_cov_frame_qh(frame, alpha=alpha).reindex(index=syms, columns=syms)


def ewma_cov_mp(rets, alpha=0.06):
    covar = pd.DataFrame(index=rets.columns, columns=rets.columns)
    delta = rets.isnull().sum(axis=1).shift(1) - rets.isnull().sum(axis=1)
    dates = delta.loc[delta != 0].index.tolist()

    func = partial(ewma_cov_mp_worker, rets=rets, alpha=alpha)
    covs = {}

    with ProcessPoolExecutor(max_workers=6) as exec:
        future_to_date = {exec.submit(func, date): date for date in dates}
        covs = {future_to_date[future]: future.result() for future in as_completed(future_to_date)}

    for date in dates:
        covar.fillna(covs[date], inplace=True)

    return covar

[I have not added as answer as not addressed the original question and I am optimistic there is a better solution.]

解决方案

since you don't really care for ewm, i.e, you only take the last value. We can try matrix multiplication:

def ewma(df, alpha=0.94):
    weights = (1-alpha) ** np.arange(len(df))[::-1]

    # fillna with 0 here
    normalized = (df-df.mean()).fillna(0).to_numpy()
    
    out =  ((weights * normalized.T) @ normalized / weights.sum()
    
    return out

 # verify
 out = ewma(df)
 print(out[0,1] == ewma_cov_pairwise(df[0],df[1]) )
 # True

And this took about 150 ms on my system with df.shape==(2000,2000) while your code refuses to run within minutes :-).

这篇关于 pandas 的EWMA协方差矩阵-优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆