填充滚动相关矩阵的缺失值 [英] Filling in missing values of a rolling correlation matrix

查看:45
本文介绍了填充滚动相关矩阵的缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题部分地与这个问题有关.

This question partially relates to this question.

可以在此处找到我的数据文件.我使用了从 2008 年 1 月 1 日到 2013 年 12 月 31 日的样本期.数据文件没有缺失值.

My datafile can be found here. I use a sample period from 01 Jan 2008 to 31 Dec 2013. The datafile has no missing values.

以下代码使用前一年值的滚动窗口生成从 2008 年 1 月 1 日到 2013 年 12 月 31 日的每一天的滚动相关矩阵.例如,2008 年 1 月 1 日 AUTBEL 之间的相关性是使用 2007 年 1 月 1 日到 2008 年 1 月 1 日的系列值计算的,对于所有其他对也是如此.

The following code generates the rolling correlation matrix on each day from 01 Jan 2008 to 31 Dec 2013 using a rolling window of the previous 1 year worth of values. E.g., the correlation between AUT and BEL on 01 Jan 2008 is calculated using the series of values from 01 Jan 2007 to 01 Jan 2008, and likewise for all other pairs.

data work.rolling;
set mm.rolling;
run;

%macro rollingCorrelations(inputDataset=, refDate=);
/*first get a list of unique dates on or after the reference date*/
proc freq data = &inputDataset. noprint;
where date >="&refDate."d;
table date/out = dates(keep = date);
run;


/*for each date calculate what the window range is, here using a year's length*/
data dateRanges(drop = date);
set dates end = endOfFile 
                nobs= numDates;
format toDate fromDate date9.;

toDate=date;
fromDate = intnx('year', toDate, -1, 's');

call symputx(compress("toDate"!!_n_), put(toDate,date9.));
call symputx(compress("fromDate"!!_n_), put(fromDate, date9.) );

/*find how many times(numberOfWindows) we need to iterate through*/
if endOfFile then do;
call symputx("numberOfWindows", numDates);
end;

run;
%do i = 1 %to &numberOfWindows.;
/*create a temporary view which has the filtered data that is passed to PROC CORR*/
data windowedDataview / view = windowedDataview;
set  &inputDataset.;
where date between "&&fromDate&i."d and "&&toDate&i."d;
drop date;
run;
    /*the output dataset from each PROC CORR run will be 
correlation_DDMMMYYY<from date>_DDMMMYY<start date>*/
proc corr data = windowedDataview 
outp = correlations_&&fromDate&i.._&&toDate&i. (where=(_type_ = 'CORR'))

        noprint;
run;

%end;

/*append all datasets into a single table*/
data all_correlations;
format from to date9.;
set correlations_:
     indsname = datasetname
;
from = input(substr(datasetname,19,9),date9.);
to = input(substr(datasetname,29,9), date9.);
run;


%mend rollingCorrelations;
%rollingCorrelations(inputDataset=rolling, refDate=01JAN2008)

可以在此处找到输出的摘录.

An excerpt of the output can be found here.

可以看出,第 2 行到第 53 行显示了 2008 年 4 月 1 日的相关矩阵.但是,2009 年 4 月 1 日的相关矩阵出现了一个问题: 的相关系数存在缺失值ALPHA 及其对.这是因为如果查看数据文件,从 2008 年 4 月 1 日到 2009 年 4 月 1 日的 ALPHA 值都为零,因此导致除以零.这种情况也发生在其他一些数据值上,例如,HSBC 从 1 Apr 08 到 1 Apr 09 的所有值也都为 0.

As can be seen row 2 to row 53 presents the correlation matrix for the day 1 Apr 2008. However, a problem arises for the correlation matrix for the day 1 Apr 2009: there are missing values for correlation coefficients for ALPHA and its pairs. This is because if one looks at the datafile, the values for ALPHA from 1 Apr 2008 to 1 Apr 2009 are all zero, hence causing a division by zero. This situation happens with a few other data values too, for example, HSBC also has all values as 0 from 1 Apr 08 to 1 Apr 09.

为了解决这个问题,我想知道如何修改上面的代码,以便在发生这种情况的情况下(即,两个特定日期之间的所有值都是 0),那么两对数据值之间的相关性是使用整个样本期简单计算.例如,ALPHAAUT 之间的相关性在 09 年 4 月 1 日缺失,因此应使用 2008 年 1 月 1 日至 2013 年 12 月 31 日的值计算此相关性,而不是使用从 1 Apr 08 到 1 Apr 09

To resolve this issue, I was wondering how the above code can be modified so that in cases where this situation happens (i.e., all values are 0 between 2 certain dates), then the correlation between the two pairs of data values are simply calculated using the WHOLE sample period. E.g., the correlation between ALPHA and AUT is missing on 1 Apr 09, thus this correlation should be calculated using the values from 1 JAN 2008 to 31 DEC 2013, rather than using the values from 1 Apr 08 to 1 Apr 09

推荐答案

一旦您运行上述宏并获得了 all_correlations 数据集,您将需要运行另一个 PROC CORR这次使用了所有的数据,即

Once you run the above macro and have got your all_correlations dataset, you would need to run another PROC CORR this time using all of the data i.e.,

/*first filter the data to be between "01JAN2008"d and "31DEC2013"d*/
data work.all_data_01JAN2008_31DEC2013;
set mm.rolling;
where date between "01JAN2008"d and "31DEC2013"d;
drop date ;
run;

然后将上面的数据集传递给PROC CORR:

Then pass the above dataset to PROC CORR:

proc corr data =  work.all_data_01JAN2008_31DEC2013
outp = correlations_01JAN2008_31DEC2013
 (where=(_type_ = 'CORR'))

        noprint;
run;
data correlations_01JAN2008_31DEC2013;
length id 8;
set correlations_01JAN2008_31DEC2013;
/*add a column identifier to make sure the order of the correlation matrix is preserved when joined with other tables*/
id = _n_;
run;

您将获得一个数据集,该数据集在 _name_ 列中是唯一的.然后,您必须将 correlations_01JAN2008_31DEC2013 加入 all_correlations,这样如果 all_correlations 中缺少一个值,那么 中的相应值相关性_01JAN2008_31DEC2013 被插入到它的位置.为此,我们可以使用 PROC SQL &COALESCE 函数.

You would get a dataset which is unique by the _name_ column. Then you would have to join correlations_01JAN2008_31DEC2013 to all_correlations in such a way that if a value is missing in all_correlations then a corresponding value from correlations_01JAN2008_31DEC2013 is inserted in its place. For this we can use PROC SQL & the COALESCE function.

PROC SQL;
CREATE TABLE MISSING_VALUES_IMPUTED AS 
SELECT
A.FROM
,A.TO
,b.id
,a._name_
,coalesce(a.AUT,b.AUT) as AUT
,coalesce(a.BEL,b.BEL) as BEL
,coalesce(a.DEN,b.DEN) as DEN
,coalesce(a.FRA,b.FRA) as FRA
,coalesce(a.GER,b.GER) as GER
,coalesce(a.GRE,b.GRE) as GRE
,coalesce(a.IRE,b.IRE) as IRE
,coalesce(a.ITA,b.ITA) as ITA
,coalesce(a.NOR,b.NOR) as NOR
,coalesce(a.POR,b.POR) as POR
,coalesce(a.SPA,b.SPA) as SPA
,coalesce(a.SWE,b.SWE) as SWE
,coalesce(a.NL,b.NL) as NL
,coalesce(a.ERS,b.ERS) as ERS
,coalesce(a.RZB,b.RZB) as RZB
,coalesce(a.DEX,b.DEX) as DEX
,coalesce(a.KBD,b.KBD) as KBD
,coalesce(a.DAB,b.DAB) as DAB
,coalesce(a.BNP,b.BNP) as BNP
,coalesce(a.CRDA,b.CRDA) as CRDA
,coalesce(a.KN,b.KN) as KN
,coalesce(a.SGE,b.SGE) as SGE
,coalesce(a.CBK,b.CBK) as CBK
,coalesce(a.DBK,b.DBK) as DBK
,coalesce(a.IKB,b.IKB) as IKB
,coalesce(a.ALPHA,b.ALPHA) as ALPHA
,coalesce(a.ALBK,b.ALBK) as ALBK
,coalesce(a.IPM,b.IPM) as IPM
,coalesce(a.BKIR,b.BKIR) as BKIR
,coalesce(a.BMPS,b.BMPS) as BMPS
,coalesce(a.PMI,b.PMI) as PMI
,coalesce(a.PLO,b.PLO) as PLO
,coalesce(a.BINS,b.BINS) as BINS
,coalesce(a.MB,b.MB) as MB
,coalesce(a.UC,b.UC) as UC
,coalesce(a.BCP,b.BCP) as BCP
,coalesce(a.BES,b.BES) as BES
,coalesce(a.BBV,b.BBV) as BBV
,coalesce(a.SCHSPS,b.SCHSPS) as SCHSPS
,coalesce(a.NDA,b.NDA) as NDA
,coalesce(a.SEA,b.SEA) as SEA
,coalesce(a.SVK,b.SVK) as SVK
,coalesce(a.SPAR,b.SPAR) as SPAR
,coalesce(a.CSGN,b.CSGN) as CSGN
,coalesce(a.UBSN,b.UBSN) as UBSN
,coalesce(a.ING,b.ING) as ING
,coalesce(a.SNS,b.SNS) as SNS
,coalesce(a.BARC,b.BARC) as BARC
,coalesce(a.HBOS,b.HBOS) as HBOS
,coalesce(a.HSBC,b.HSBC) as HSBC
,coalesce(a.LLOY,b.LLOY) as LLOY
,coalesce(a.STANBS,b.STANBS) as STANBS
from all_correlations as a
inner join correlations_01JAN2008_31DEC2013 as b
on a._name_ = b._name_
order by
A.FROM
,A.TO
,b.id
;
quit;
/*verify that no missing values are left. NMISS column should be 0 from all variables*/
proc means data = MISSING_VALUES_IMPUTED n nmiss;
run;

这篇关于填充滚动相关矩阵的缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆