如何使用 pandas 获取两个时间序列之间的相关性 [英] How to get the correlation between two timeseries using Pandas

查看:172
本文介绍了如何使用 pandas 获取两个时间序列之间的相关性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两组温度数据,它们有规律(但不同)的时间间隔读数.我正在尝试获得这两套数据之间的相关性.

I have two sets of temperature date, which have readings at regular (but different) time intervals. I'm trying to get the correlation between these two sets of data.

我一直在玩熊猫尝试这样做.我创建了两个时间序列,并且正在使用TimeSeriesA.corr(TimeSeriesB).但是,如果2 timeSeries中的时间不完全匹配(通常相差几秒钟),我将得到Null作为答案.如果可以的话,我可以得到一个不错的答案:

I've been playing with Pandas to try to do this. I've created two timeseries, and am using TimeSeriesA.corr(TimeSeriesB). However, if the times in the 2 timeSeries do not match up exactly (they're generally off by seconds), I get Null as an answer. I could get a decent answer if I could:

a)插值/填充每个TimeSeries中的缺失时间(我知道这在Pandas中是可能的,我只是不知道该怎么做)

a) Interpolate/fill missing times in each TimeSeries (I know this is possible in Pandas, I just don't know how to do it)

b)从python datetime对象中删除秒(将秒设置为00,而不更改分钟).我会失去一定的准确性,但不是很多

b) strip the seconds out of python datetime objects (Set seconds to 00, without changing minutes). I'd lose a degree of accuracy, but not a huge amount

c)在Pandas中使用其他东西来获取两个timeSeries之间的相关性

c) Use something else in Pandas to get the correlation between two timeSeries

d)在python中使用一些东西来获取两个float列表之间的相关性,每个float都有一个对应的datetime对象,并考虑了时间.

d) Use something in python to get the correlation between two lists of floats, each float having a corresponding datetime object, taking into account the time.

有人有什么建议吗?

推荐答案

您有很多使用pandas的选项,但是您必须决定对齐数据的合理性,因为它们不会出现在相同的时刻.

You have a number of options using pandas, but you have to make a decision about how it makes sense to align the data given that they don't occur at the same instants.

使用时间序列之一中的时间截至"值,这是一个示例:

    In [15]: ts
    Out[15]: 
    2000-01-03 00:00:00    -0.722808451504
    2000-01-04 00:00:00    0.0125041039477
    2000-01-05 00:00:00    0.777515530539
    2000-01-06 00:00:00    -0.35714026263
    2000-01-07 00:00:00    -1.55213541118
    2000-01-10 00:00:00    -0.508166334892
    2000-01-11 00:00:00    0.58016097981
    2000-01-12 00:00:00    1.50766289013
    2000-01-13 00:00:00    -1.11114968643
    2000-01-14 00:00:00    0.259320239297



    In [16]: ts2
    Out[16]: 
    2000-01-03 00:00:30    1.05595278907
    2000-01-04 00:00:30    -0.568961755792
    2000-01-05 00:00:30    0.660511172645
    2000-01-06 00:00:30    -0.0327384421979
    2000-01-07 00:00:30    0.158094407533
    2000-01-10 00:00:30    -0.321679671377
    2000-01-11 00:00:30    0.977286027619
    2000-01-12 00:00:30    -0.603541295894
    2000-01-13 00:00:30    1.15993249209
    2000-01-14 00:00:30    -0.229379534767

您可以看到它们在30秒后熄灭. reindex功能使您可以在填充正向值(获取"as of"值)的同时对齐数据:

you can see these are off by 30 seconds. The reindex function enables you to align data while filling forward values (getting the "as of" value):

    In [17]: ts.reindex(ts2.index, method='pad')
    Out[17]: 
    2000-01-03 00:00:30    -0.722808451504
    2000-01-04 00:00:30    0.0125041039477
    2000-01-05 00:00:30    0.777515530539
    2000-01-06 00:00:30    -0.35714026263
    2000-01-07 00:00:30    -1.55213541118
    2000-01-10 00:00:30    -0.508166334892
    2000-01-11 00:00:30    0.58016097981
    2000-01-12 00:00:30    1.50766289013
    2000-01-13 00:00:30    -1.11114968643
    2000-01-14 00:00:30    0.259320239297

    In [18]: ts2.corr(ts.reindex(ts2.index, method='pad'))
    Out[18]: -0.31004148593302283

请注意,"pad"也以"ffill"作为别名(但仅限于GitHub上最新版本的熊猫!).

note that 'pad' is also aliased by 'ffill' (but only in the very latest version of pandas on GitHub as of this time!).

缩短您所有约会时间的秒数.最好的方法是使用rename

Strip seconds out of all your datetimes. The best way to do this is to use rename

    In [25]: ts2.rename(lambda date: date.replace(second=0))
    Out[25]: 
    2000-01-03 00:00:00    1.05595278907
    2000-01-04 00:00:00    -0.568961755792
    2000-01-05 00:00:00    0.660511172645
    2000-01-06 00:00:00    -0.0327384421979
    2000-01-07 00:00:00    0.158094407533
    2000-01-10 00:00:00    -0.321679671377
    2000-01-11 00:00:00    0.977286027619
    2000-01-12 00:00:00    -0.603541295894
    2000-01-13 00:00:00    1.15993249209
    2000-01-14 00:00:00    -0.229379534767

请注意,如果重命名导致日期重复,则会抛出Exception.

Note that if rename causes there to be duplicate dates an Exception will be thrown.

对于更高级的内容,假设您想关联每分钟的平均值(您每秒有多个观测值):

For something a little more advanced, suppose you wanted to correlate the mean value for each minute (where you have multiple observations per second):

    In [31]: ts_mean = ts.groupby(lambda date: date.replace(second=0)).mean()

    In [32]: ts2_mean = ts2.groupby(lambda date: date.replace(second=0)).mean()

    In [33]: ts_mean.corr(ts2_mean)
    Out[33]: -0.31004148593302283

如果您没有 https://github.com/wesm/pandas .如果.mean()在上述的GroupBy对象上不起作用,请尝试.agg(np.mean)

These last code snippets may not work if you don't have the latest code from https://github.com/wesm/pandas. If .mean() doesn't work on a GroupBy object per above try .agg(np.mean)

希望这会有所帮助!

这篇关于如何使用 pandas 获取两个时间序列之间的相关性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆