与大 pandas 互相关(时滞相关)? [英] Cross-correlation (time-lag-correlation) with pandas?
问题描述
我有各种时间序列,我想相互关联(或更确切地说,相互关联),以找出相关因子在哪个时间滞后最大.
I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest.
我发现各种
I found various questions and answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos.
修改
所有numpy/scipy方法的问题在于,它们似乎缺乏对我的数据的时间序列性质的认识.当我将一个始于1940年的时间序列与一个始于1970年的时间序列相关联时,大熊猫corr
知道这一点,而np.correlate
只是产生1020个条目(较长序列的长度),该数组充满nan.
The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in say 1940 with one that starts in 1970, pandas corr
knows this, whereas np.correlate
just produces a 1020 entries (length of the longer series) array full of nan.
关于该主题的各种Q表示应该有一种方法来解决不同长度的问题,但是到目前为止,我还没有迹象表明如何在特定时间段使用它.我只需要以1为增量以12个月为间隔,以查看一年内最大的相关时间.
The various Q's on this subject indicate that there should be a way to solve the different length issue, but so far, I have seen no indication on how to use it for specific time periods. I just need to shift by 12 months in increments of 1, for seeing the time of maximum correlation within one year.
Edit2
一些最小样本数据:
import pandas as pd
import numpy as np
dfdates1 = pd.date_range('01/01/1980', '01/01/2000', freq = 'MS')
dfdata1 = (np.random.random_integers(-30,30,(len(dfdates1)))/10.0) #My real data is from measurements, but random between -3 and 3 is fitting
df1 = pd.DataFrame(dfdata1, index = dfdates1)
dfdates2 = pd.date_range('03/01/1990', '02/01/2013', freq = 'MS')
dfdata2 = (np.random.random_integers(-30,30,(len(dfdates2)))/10.0)
df2 = pd.DataFrame(dfdata2, index = dfdates2)
由于各种处理步骤,这些df最终变成了从1940年到2015年建立索引的df.这应该重现此内容:
Due to various processing steps, those dfs end up changed into df that are indexed from 1940 to 2015. this should reproduce this:
bigdates = pd.date_range('01/01/1940', '01/01/2015', freq = 'MS')
big1 = pd.DataFrame(index = bigdates)
big2 = pd.DataFrame(index = bigdates)
big1 = pd.concat([big1, df1],axis = 1)
big2 = pd.concat([big2, df2],axis = 1)
这是我与大熊猫关联并移动一个数据集时得到的:
This is what I get when I correlate with pandas and shift one dataset:
In [451]: corr_coeff_0 = big1[0].corr(big2[0])
In [452]: corr_coeff_0
Out[452]: 0.030543266378853299
In [453]: big2_shift = big2.shift(1)
In [454]: corr_coeff_1 = big1[0].corr(big2_shift[0])
In [455]: corr_coeff_1
Out[455]: 0.020788314779320523
并尝试scipy:
In [456]: scicorr = scipy.signal.correlate(big1,big2,mode="full")
In [457]: scicorr
Out[457]:
array([[ nan],
[ nan],
[ nan],
...,
[ nan],
[ nan],
[ nan]])
根据whos
是
scicorr ndarray 1801x1: 1801 elems, type `float64`, 14408 bytes
但是我只想输入12个条目. /Edit2
But I'd just like to have 12 entries. /Edit2
我想到的想法是自己实现时滞相关,就像这样:
The idea I have come up with, is to implement a time-lag-correlation myself, like so:
corr_coeff_0 = df1['Data'].corr(df2['Data'])
df1_1month = df1.shift(1)
corr_coeff_1 = df1_1month['Data'].corr(df2['Data'])
df1_6month = df1.shift(6)
corr_coeff_6 = df1_6month['Data'].corr(df2['Data'])
...and so on
但这可能很慢,我可能正在尝试在这里重新发明轮子. 编辑上面的方法似乎可行,并且我将其循环使用,可以遍历一年的所有12个月,但我仍然希望使用内置方法.
But this is probably slow, and I am probably trying to reinvent the wheel here. Edit The above approach seems to work, and I have put it into a loop, to go through all 12 months of a year, but I still would prefer a built in method.
推荐答案
据我所知,没有内置的方法可以准确地 您所要的内容.但是,如果您查看pandas Series方法autocorr
的源代码,您会发现您有正确的主意:
As far as I can tell, there isn't a built in method that does exactly what you are asking. But if you look at the source code for the pandas Series method autocorr
, you can see you've got the right idea:
def autocorr(self, lag=1):
"""
Lag-N autocorrelation
Parameters
----------
lag : int, default 1
Number of lags to apply before performing autocorrelation.
Returns
-------
autocorr : float
"""
return self.corr(self.shift(lag))
因此,一个简单的时滞交叉协方差函数将是
So a simple timelagged cross covariance function would be
def crosscorr(datax, datay, lag=0):
""" Lag-N cross correlation.
Parameters
----------
lag : int, default 0
datax, datay : pandas.Series objects of equal length
Returns
----------
crosscorr : float
"""
return datax.corr(datay.shift(lag))
然后,如果您想查看每个月的互相关,则可以
Then if you wanted to look at the cross correlations at each month, you could do
xcov_monthly = [crosscorr(datax, datay, lag=i) for i in range(12)]
这篇关于与大 pandas 互相关(时滞相关)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!