pandas 滚动应用自定义 [英] Pandas Rolling Apply custom

查看:150
本文介绍了 pandas 滚动应用自定义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在此处遵循类似的答案,但是我在使用sklearn并滚动应用时遇到一些问题.我正在尝试创建z分数并通过滚动应用进行PCA,但我一直在获取'only length-1 arrays can be converted to Python scalars' error.

I have been following a similar answer here, but I have some questions when using sklearn and rolling apply. I am trying to create z-scores and do PCA with rolling apply, but I keep on getting 'only length-1 arrays can be converted to Python scalars' error.

按照上一个示例,我创建一个数据框

Following the previous example I create a dataframe

from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
sc=StandardScaler() 
tmp=pd.DataFrame(np.random.randn(2000,2)/10000,index=pd.date_range('2001-01-01',periods=2000),columns=['A','B'])

如果我使用rolling命令:

 tmp.rolling(window=5,center=False).apply(lambda x: sc.fit_transform(x))
 TypeError: only length-1 arrays can be converted to Python scalars

我收到此错误.但是,我可以毫无问题地创建具有均值和标准差的函数.

I get this error. I can however create functions with mean and standard deviations with no problem.

def test(df):
    return np.mean(df)
tmp.rolling(window=5,center=False).apply(lambda x: test(x))

我认为当我尝试用z分数的当前值减去平均值时会发生错误.

I believe the error occurs when I am trying to subtract the mean by the current values for z-score.

def test2(df):
    return df-np.mean(df)
tmp.rolling(window=5,center=False).apply(lambda x: test2(x))
only length-1 arrays can be converted to Python scalars

如何使用sklearn创建自定义滚动功能以首先标准化然后运行PCA?

How can I create custom rolling functions with sklearn to first standardize and then run PCA?

我意识到我的问题还不太清楚,所以我会再试一次.我想标准化我的值,然后运行PCA来获取每个因素所解释的方差量.无需滚动即可做到这一点非常简单.

I realize my question was not exactly clear so I shall try again. I want to standardize my values and then run PCA to get the amount of variance explained by each factor. Doing this without rolling is fairly straightforward.

testing=sc.fit_transform(tmp)
pca=decomposition.pca.PCA() #run pca
pca.fit(testing) 
pca.explained_variance_ratio_
array([ 0.50967441,  0.49032559])

滚动时,我无法使用相同的过程.使用@piRSquared中的滚动zscore函数可以得到zscores.似乎sklearn的PCA与滚动应用自定义功能不兼容. (实际上,我认为大多数sklearn模块都是这种情况.)我只是想获得解释的方差,它是一维项,但是下面的代码返回了一堆NaN.

I cannot use this same procedure when rolling. Using the rolling zscore function from @piRSquared gives the zscores. It seems that PCA from sklearn is incompatible with the rolling apply custom function. (In fact I think this is the case with most sklearn modules.) I am just trying to get the explained variance which is a one dimensional item, but this code below returns a bunch of NaNs.

def test3(df):
    pca.fit(df)
    return pca.explained_variance_ratio_
tmp.rolling(window=5,center=False).apply(lambda x: test3(x))

但是,我可以创建自己的解释方差函数,但这也不起作用.

However, I can create my own explained variance function, but this also does not work.

def test4(df):
    cov_mat=np.cov(df.T) #need covariance of features, not observations
    eigen_vals,eigen_vecs=np.linalg.eig(cov_mat)
    tot=sum(eigen_vals)
    var_exp=[(i/tot) for i in sorted(eigen_vals,reverse=True)]
    return var_exp
tmp.rolling(window=5,center=False).apply(lambda x: test4(x))

我收到此错误0-dimensional array given. Array must be at least two-dimensional.

回顾一下,我想运行滚动的z分数,然后滚动pca,在每次滚动时输出解释的方差.我的z得分一直在下降,但是没有解释方差.

To recap, I would like to run rolling z-scores and then rolling pca outputting the explained variance at each roll. I have the rolling z-scores down but not explained variance.

推荐答案

正如@BrenBarn所评论的那样,滚动功能需要将向量简化为单个数字.以下内容等同于您尝试做的事情,并且可以帮助您突出显示问题.

As @BrenBarn commented, the rolling function needs to reduce a vector to a single number. The following is equivalent to what you were trying to do and help's highlight the problem.

zscore = lambda x: (x - x.mean()) / x.std()
tmp.rolling(5).apply(zscore)

TypeError: only length-1 arrays can be converted to Python scalars

zscore函数中,x.mean()减少,x.std()减少,但是x是一个数组.因此整个事情都是一个数组.

In the zscore function, x.mean() reduces, x.std() reduces, but x is an array. Thus the entire thing is an array.

解决此问题的方法是在需要进行z分数计算的部分上进行滚动,而不是在引起问题的部分上进行滚动.

The way around this is to perform the roll on the parts of the z-score calculation that require it, and not on the parts that cause the problem.

(tmp - tmp.rolling(5).mean()) / tmp.rolling(5).std()

这篇关于 pandas 滚动应用自定义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆