pandas 绘制时间序列,差距最小 [英] pandas plot time-series with minimized gaps

查看:56
本文介绍了 pandas 绘制时间序列,差距最小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始探索大熊猫的深度,并想可视化一些包含缺口的时间序列数据,其中一些缺口很大.一个示例 mydf :

 时间戳 val0 2016-07-25 00:00:00 0.7404421 2016-07-25 01:00:00 0.8429112 2016-07-25 02:00:00 -0.8739923 2016-07-25 07:00:00 -0.4749934 2016-07-25 08:00:00 -0.9839635 2016-07-25 09:00:00 0.5970116 2016-07-25 10:00:00 -2.0430237 2016-07-25 12:00:00 0.3046688 2016-07-25 13:00:00 1.1859979 2016-07-25 14:00:00 0.92085010 2016-07-25 15:00:00 0.20142311 2016-07-25 16:00:00 0.84297012 2016-07-25 21:00:00 1.06120713 2016-07-25 22:00:00 0.23218014 2016-07-25 23:00:00 0.453964

现在我可以通过 df1.plot(x='timestamp').get_figure().show() 绘制我的 DataFrame 并且沿 x 轴的数据将被插值(显示为一行):

我想拥有的是:

  • 带有数据的部分之间的可见间隙
  • 为不同的间隙长度提供一致的间隙宽度
  • 也许在轴上有某种形式的标记,这有助于弄清执行了一些时间跳跃这一事实.

研究这个问题我遇到过

  • I recently started to explore into the depths of pandas and would like to visualize some time-series data which contains gaps, some of them rather large. an example mydf:

                 timestamp       val
    0  2016-07-25 00:00:00  0.740442
    1  2016-07-25 01:00:00  0.842911
    2  2016-07-25 02:00:00 -0.873992
    3  2016-07-25 07:00:00 -0.474993
    4  2016-07-25 08:00:00 -0.983963
    5  2016-07-25 09:00:00  0.597011
    6  2016-07-25 10:00:00 -2.043023
    7  2016-07-25 12:00:00  0.304668
    8  2016-07-25 13:00:00  1.185997
    9  2016-07-25 14:00:00  0.920850
    10 2016-07-25 15:00:00  0.201423
    11 2016-07-25 16:00:00  0.842970
    12 2016-07-25 21:00:00  1.061207
    13 2016-07-25 22:00:00  0.232180
    14 2016-07-25 23:00:00  0.453964
    

    now i could plot my DataFrame through df1.plot(x='timestamp').get_figure().show() and data along the x-axis would be interpolated (appearing as one line):

    what i would like to have instead is:

    • visible gaps between sections with data
    • a consistent gap-width for differing gaps-legths
    • perhaps some form of marker in the axis which helps to clarify the fact that some jumps in time are performed.

    researching in this matter i've come across

    which generally come close to what i'm after but the former approach would yield in simply leaving the gaps out of the plotted figure and the latter in large gaps that i would like to avoid (think of gaps that may even span a few days).

    as the second approach may be closer i tried to use my timestamp-column as an index through:

    mydf2 = pd.DataFrame(data=list(mydf['val']), index=mydf[0])
    

    which allows me to fill the gaps with NaN through reindexing (wondering if there is a more simple solution to achive this):

    mydf3 = mydf2.reindex(pd.date_range('25/7/2016', periods=24, freq='H'))
    

    leading to:

                              val
    2016-07-25 00:00:00  0.740442
    2016-07-25 01:00:00  0.842911
    2016-07-25 02:00:00 -0.873992
    2016-07-25 03:00:00       NaN
    2016-07-25 04:00:00       NaN
    2016-07-25 05:00:00       NaN
    2016-07-25 06:00:00       NaN
    2016-07-25 07:00:00 -0.474993
    2016-07-25 08:00:00 -0.983963
    2016-07-25 09:00:00  0.597011
    2016-07-25 10:00:00 -2.043023
    2016-07-25 11:00:00       NaN
    2016-07-25 12:00:00  0.304668
    2016-07-25 13:00:00  1.185997
    2016-07-25 14:00:00  0.920850
    2016-07-25 15:00:00  0.201423
    2016-07-25 16:00:00  0.842970
    2016-07-25 17:00:00       NaN
    2016-07-25 18:00:00       NaN
    2016-07-25 19:00:00       NaN
    2016-07-25 20:00:00       NaN
    2016-07-25 21:00:00  1.061207
    2016-07-25 22:00:00  0.232180
    2016-07-25 23:00:00  0.453964
    

    from here on i might need to reduce consecutive entries over a certain limit with missing data to a fix number (representing my gap-width) and do something to the index-value of these entries so they are plotted differently but i got lost here i guess as i don't know how to achieve something like that.

    while tinkering around i wondered if there might be a more direct and elegant approach and would be thankful if anyone knowing more about this could point me towards the right direction.

    thanks for any hints and feedback in advance!

    ### ADDENDUM ###

    After posting my question I've come across another interesting idea postend by Andy Hayden that seems helpful. He's using a column to hold the results of a comparison of the difference with a time-delta. After performing a cumsum() on the int-representation of the boolean results he uses groupby() to cluster entries of each ungapped-series into a DataFrameGroupBy-object.

    As this was written some time ago pandas now returns timedelta-objects so the comparison should be done with another timedelta-object like so (based on the mydf from above or on the reindexed df2 after copying its index to a now column through mydf2['timestamp'] = mydf2.index):

    from datetime import timedelta
    myTD = timedelta(minutes=60)
    mydf['nogap'] = mydf['timestamp'].diff() > myTD
    mydf['nogap'] = mydf['nogap'].apply(lambda x: 1 if x else 0).cumsum() 
    ## btw.: why not "... .apply(lambda x: int(x)) ..."?
    dfg = mydf.groupby('nogap')
    

    We now could iterate over the DataFrameGroup getting the ungapped series and do something with them. My pandas/mathplot-skills are way too immature but could we plot the group-elements into sub-plots? maybe that way the discontinuity along the time-axis could be represented in some way (in form of an interrupted axis-line or such)?

    piRSquared's answer already leads to a quite usable result with the only thing kind of missing being a more striking visual feedback along the time-axis that a gap/time-jump has occurred between two values.

    Maybe with the grouped Sections the width of the gap-representation could be more configurable?

    解决方案

    I built a new series and plotted it. This is not super elegant! But I believe gets you what you wanted.

    Setup

    Do this to get to your starting point

    from StringIO import StringIO
    import pandas as pd
    
    text = """          timestamp       val
    2016-07-25 00:00:00   0.740442
    2016-07-25 01:00:00   0.842911
    2016-07-25 02:00:00  -0.873992
    2016-07-25 07:00:00  -0.474993
    2016-07-25 08:00:00  -0.983963
    2016-07-25 09:00:00   0.597011
    2016-07-25 10:00:00  -2.043023
    2016-07-25 12:00:00   0.304668
    2016-07-25 13:00:00   1.185997
    2016-07-25 14:00:00   0.920850
    2016-07-25 15:00:00   0.201423
    2016-07-25 16:00:00   0.842970
    2016-07-25 21:00:00   1.061207
    2016-07-25 22:00:00   0.232180
    2016-07-25 23:00:00   0.453964"""
    
    s1 = pd.read_csv(StringIO(text),
                     index_col=0,
                     parse_dates=[0],
                     engine='python',
                     sep='\s{2,}').squeeze()
    
    s1
    
    timestamp
    2016-07-25 00:00:00    0.740442
    2016-07-25 01:00:00    0.842911
    2016-07-25 02:00:00   -0.873992
    2016-07-25 07:00:00   -0.474993
    2016-07-25 08:00:00   -0.983963
    2016-07-25 09:00:00    0.597011
    2016-07-25 10:00:00   -2.043023
    2016-07-25 12:00:00    0.304668
    2016-07-25 13:00:00    1.185997
    2016-07-25 14:00:00    0.920850
    2016-07-25 15:00:00    0.201423
    2016-07-25 16:00:00    0.842970
    2016-07-25 21:00:00    1.061207
    2016-07-25 22:00:00    0.232180
    2016-07-25 23:00:00    0.453964
    Name: val, dtype: float64
    

    Resample hourly. resample is a deferred method, meaning it expects you to pass another method afterwards so it knows what to do. I used mean. For your example, it doesn't matter because we are sampling to a higher frequency. Look it up if you care.

    s2 = s1.resample('H').mean()
    
    s2
    
    timestamp
    2016-07-25 00:00:00    0.740442
    2016-07-25 01:00:00    0.842911
    2016-07-25 02:00:00   -0.873992
    2016-07-25 03:00:00         NaN
    2016-07-25 04:00:00         NaN
    2016-07-25 05:00:00         NaN
    2016-07-25 06:00:00         NaN
    2016-07-25 07:00:00   -0.474993
    2016-07-25 08:00:00   -0.983963
    2016-07-25 09:00:00    0.597011
    2016-07-25 10:00:00   -2.043023
    2016-07-25 11:00:00         NaN
    2016-07-25 12:00:00    0.304668
    2016-07-25 13:00:00    1.185997
    2016-07-25 14:00:00    0.920850
    2016-07-25 15:00:00    0.201423
    2016-07-25 16:00:00    0.842970
    2016-07-25 17:00:00         NaN
    2016-07-25 18:00:00         NaN
    2016-07-25 19:00:00         NaN
    2016-07-25 20:00:00         NaN
    2016-07-25 21:00:00    1.061207
    2016-07-25 22:00:00    0.232180
    2016-07-25 23:00:00    0.453964
    Freq: H, Name: val, dtype: float64
    

    Ok, so you also wanted equally sized gaps. This was a tad tricky. I used ffill(limit=1) to fill in only one space of each gap. Then I took the slice of s2 where this forward filled thing was not null. This gives me a single null for each gap.

    s3 = s2[s2.ffill(limit=1).notnull()]
    
    s3
    
    timestamp
    2016-07-25 00:00:00    0.740442
    2016-07-25 01:00:00    0.842911
    2016-07-25 02:00:00   -0.873992
    2016-07-25 03:00:00         NaN
    2016-07-25 07:00:00   -0.474993
    2016-07-25 08:00:00   -0.983963
    2016-07-25 09:00:00    0.597011
    2016-07-25 10:00:00   -2.043023
    2016-07-25 11:00:00         NaN
    2016-07-25 12:00:00    0.304668
    2016-07-25 13:00:00    1.185997
    2016-07-25 14:00:00    0.920850
    2016-07-25 15:00:00    0.201423
    2016-07-25 16:00:00    0.842970
    2016-07-25 17:00:00         NaN
    2016-07-25 21:00:00    1.061207
    2016-07-25 22:00:00    0.232180
    2016-07-25 23:00:00    0.453964
    Name: val, dtype: float64
    

    Lastly, if I plotted this, I still get irregular gaps. I need str indices so that matplotlib doesn't try to expand out my dates.

    s3.reindex(s3.index.strftime('%H:%M'))
    
    timestamp
    00:00    0.740442
    01:00    0.842911
    02:00   -0.873992
    03:00         NaN
    07:00   -0.474993
    08:00   -0.983963
    09:00    0.597011
    10:00   -2.043023
    11:00         NaN
    12:00    0.304668
    13:00    1.185997
    14:00    0.920850
    15:00    0.201423
    16:00    0.842970
    17:00         NaN
    21:00    1.061207
    22:00    0.232180
    23:00    0.453964
    Name: val, dtype: float64
    

    I'll plot them together so we can see the difference.

    f, a = plt.subplots(2, 1, sharey=True, figsize=(10, 5))
    s2.plot(ax=a[0])
    s3.reindex(s3.index.strftime('%H:%M')).plot(ax=a[1])
    

    这篇关于 pandas 绘制时间序列,差距最小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆