与大 pandas 互相关(时滞相关)? [英] Cross-correlation (time-lag-correlation) with pandas?

查看:186
本文介绍了与大 pandas 互相关(时滞相关)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有各种时间序列,我想相互关联(或更确切地说,相互关联),以找出相关因子在哪个时间滞后最大.

I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest.

我发现各种

I found various questions and answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos.

修改

所有numpy/scipy方法的问题在于,它们似乎缺乏对我的数据的时间序列性质的认识.当我将一个始于1940年的时间序列与一个始于1970年的时间序列相关联时,大熊猫corr知道这一点,而np.correlate只是产生1020个条目(较长序列的长度),该数组充满nan.

The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in say 1940 with one that starts in 1970, pandas corr knows this, whereas np.correlate just produces a 1020 entries (length of the longer series) array full of nan.

关于该主题的各种Q表示应该有一种方法来解决不同长度的问题,但是到目前为止,我还没有迹象表明如何在特定时间段使用它.我只需要以1为增量以12个月为间隔,以查看一年内最大的相关时间.

The various Q's on this subject indicate that there should be a way to solve the different length issue, but so far, I have seen no indication on how to use it for specific time periods. I just need to shift by 12 months in increments of 1, for seeing the time of maximum correlation within one year.

Edit2

一些最小样本数据:

import pandas as pd
import numpy as np
dfdates1 = pd.date_range('01/01/1980', '01/01/2000', freq = 'MS')
dfdata1 = (np.random.random_integers(-30,30,(len(dfdates1)))/10.0) #My real data is from measurements, but random between -3 and 3 is fitting
df1 = pd.DataFrame(dfdata1, index = dfdates1)
dfdates2 = pd.date_range('03/01/1990', '02/01/2013', freq = 'MS')
dfdata2 = (np.random.random_integers(-30,30,(len(dfdates2)))/10.0)
df2 = pd.DataFrame(dfdata2, index = dfdates2)

由于各种处理步骤,这些df最终变成了从1940年到2015年建立索引的df.这应该重现此内容:

Due to various processing steps, those dfs end up changed into df that are indexed from 1940 to 2015. this should reproduce this:

bigdates = pd.date_range('01/01/1940', '01/01/2015', freq = 'MS')
big1 = pd.DataFrame(index = bigdates)
big2 = pd.DataFrame(index = bigdates)
big1 = pd.concat([big1, df1],axis = 1)
big2 = pd.concat([big2, df2],axis = 1)

这是我与大熊猫关联并移动一个数据集时得到的:

This is what I get when I correlate with pandas and shift one dataset:

In [451]: corr_coeff_0 = big1[0].corr(big2[0])
In [452]: corr_coeff_0
Out[452]: 0.030543266378853299
In [453]: big2_shift = big2.shift(1)
In [454]: corr_coeff_1 = big1[0].corr(big2_shift[0])
In [455]: corr_coeff_1
Out[455]: 0.020788314779320523

并尝试scipy:

In [456]: scicorr = scipy.signal.correlate(big1,big2,mode="full")
In [457]: scicorr
Out[457]: 
array([[ nan],
       [ nan],
       [ nan],
       ..., 
       [ nan],
       [ nan],
       [ nan]])

根据whos

scicorr               ndarray                       1801x1: 1801 elems, type `float64`, 14408 bytes

但是我只想输入12个条目. /Edit2

But I'd just like to have 12 entries. /Edit2

我想到的想法是自己实现时滞相关,就像这样:

The idea I have come up with, is to implement a time-lag-correlation myself, like so:

corr_coeff_0 = df1['Data'].corr(df2['Data'])
df1_1month = df1.shift(1)
corr_coeff_1 = df1_1month['Data'].corr(df2['Data'])
df1_6month = df1.shift(6)
corr_coeff_6 = df1_6month['Data'].corr(df2['Data'])
...and so on

但这可能很慢,我可能正在尝试在这里重新发明轮子. 编辑上面的方法似乎可行,并且我将其循环使用,可以遍历一年的所有12个月,但我仍然希望使用内置方法.

But this is probably slow, and I am probably trying to reinvent the wheel here. Edit The above approach seems to work, and I have put it into a loop, to go through all 12 months of a year, but I still would prefer a built in method.

推荐答案

据我所知,没有内置的方法可以准确地 您所要的内容.但是,如果您查看pandas Series方法autocorr的源代码,您会发现您有正确的主意:

As far as I can tell, there isn't a built in method that does exactly what you are asking. But if you look at the source code for the pandas Series method autocorr, you can see you've got the right idea:

def autocorr(self, lag=1):
    """
    Lag-N autocorrelation

    Parameters
    ----------
    lag : int, default 1
        Number of lags to apply before performing autocorrelation.

    Returns
    -------
    autocorr : float
    """
    return self.corr(self.shift(lag))

因此,一个简单的时滞交叉协方差函数将是

So a simple timelagged cross covariance function would be

def crosscorr(datax, datay, lag=0):
    """ Lag-N cross correlation. 
    Parameters
    ----------
    lag : int, default 0
    datax, datay : pandas.Series objects of equal length

    Returns
    ----------
    crosscorr : float
    """
    return datax.corr(datay.shift(lag))

然后,如果您想查看每个月的互相关,则可以

Then if you wanted to look at the cross correlations at each month, you could do

 xcov_monthly = [crosscorr(datax, datay, lag=i) for i in range(12)]

这篇关于与大 pandas 互相关(时滞相关)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆