pandas 绘制时间序列，差距最小 [英] pandas plot time-series with minimized gaps

查看：56 发布时间：2021/6/1 19:10:02 python pandas matplotlib plot

本文介绍了 pandas 绘制时间序列，差距最小的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我最近开始探索大熊猫的深度，并想可视化一些包含缺口的时间序列数据，其中一些缺口很大.一个示例 mydf :

 时间戳 val0 2016-07-25 00:00:00 0.7404421 2016-07-25 01:00:00 0.8429112 2016-07-25 02:00:00 -0.8739923 2016-07-25 07:00:00 -0.4749934 2016-07-25 08:00:00 -0.9839635 2016-07-25 09:00:00 0.5970116 2016-07-25 10:00:00 -2.0430237 2016-07-25 12:00:00 0.3046688 2016-07-25 13:00:00 1.1859979 2016-07-25 14:00:00 0.92085010 2016-07-25 15:00:00 0.20142311 2016-07-25 16:00:00 0.84297012 2016-07-25 21:00:00 1.06120713 2016-07-25 22:00:00 0.23218014 2016-07-25 23:00:00 0.453964

现在我可以通过 df1.plot(x='timestamp').get_figure().show() 绘制我的 DataFrame 并且沿 x 轴的数据将被插值(显示为一行):

我想拥有的是:

带有数据的部分之间的可见间隙
为不同的间隙长度提供一致的间隙宽度
也许在轴上有某种形式的标记，这有助于弄清执行了一些时间跳跃这一事实.

研究这个问题我遇到过

I recently started to explore into the depths of pandas and would like to visualize some time-series data which contains gaps, some of them rather large. an example mydf:

             timestamp       val
0  2016-07-25 00:00:00  0.740442
1  2016-07-25 01:00:00  0.842911
2  2016-07-25 02:00:00 -0.873992
3  2016-07-25 07:00:00 -0.474993
4  2016-07-25 08:00:00 -0.983963
5  2016-07-25 09:00:00  0.597011
6  2016-07-25 10:00:00 -2.043023
7  2016-07-25 12:00:00  0.304668
8  2016-07-25 13:00:00  1.185997
9  2016-07-25 14:00:00  0.920850
10 2016-07-25 15:00:00  0.201423
11 2016-07-25 16:00:00  0.842970
12 2016-07-25 21:00:00  1.061207
13 2016-07-25 22:00:00  0.232180
14 2016-07-25 23:00:00  0.453964

now i could plot my DataFrame through df1.plot(x='timestamp').get_figure().show() and data along the x-axis would be interpolated (appearing as one line):

what i would like to have instead is:

visible gaps between sections with data
a consistent gap-width for differing gaps-legths
perhaps some form of marker in the axis which helps to clarify the fact that some jumps in time are performed.

researching in this matter i've come across

which generally come close to what i'm after but the former approach would yield in simply leaving the gaps out of the plotted figure and the latter in large gaps that i would like to avoid (think of gaps that may even span a few days).

as the second approach may be closer i tried to use my timestamp-column as an index through:

mydf2 = pd.DataFrame(data=list(mydf['val']), index=mydf[0])

which allows me to fill the gaps with NaN through reindexing (wondering if there is a more simple solution to achive this):

mydf3 = mydf2.reindex(pd.date_range('25/7/2016', periods=24, freq='H'))

leading to:

                          val
2016-07-25 00:00:00  0.740442
2016-07-25 01:00:00  0.842911
2016-07-25 02:00:00 -0.873992
2016-07-25 03:00:00       NaN
2016-07-25 04:00:00       NaN
2016-07-25 05:00:00       NaN
2016-07-25 06:00:00       NaN
2016-07-25 07:00:00 -0.474993
2016-07-25 08:00:00 -0.983963
2016-07-25 09:00:00  0.597011
2016-07-25 10:00:00 -2.043023
2016-07-25 11:00:00       NaN
2016-07-25 12:00:00  0.304668
2016-07-25 13:00:00  1.185997
2016-07-25 14:00:00  0.920850
2016-07-25 15:00:00  0.201423
2016-07-25 16:00:00  0.842970
2016-07-25 17:00:00       NaN
2016-07-25 18:00:00       NaN
2016-07-25 19:00:00       NaN
2016-07-25 20:00:00       NaN
2016-07-25 21:00:00  1.061207
2016-07-25 22:00:00  0.232180
2016-07-25 23:00:00  0.453964

from here on i might need to reduce consecutive entries over a certain limit with missing data to a fix number (representing my gap-width) and do something to the index-value of these entries so they are plotted differently but i got lost here i guess as i don't know how to achieve something like that.

while tinkering around i wondered if there might be a more direct and elegant approach and would be thankful if anyone knowing more about this could point me towards the right direction.

thanks for any hints and feedback in advance!

### ADDENDUM ###

After posting my question I've come across another interesting idea postend by Andy Hayden that seems helpful. He's using a column to hold the results of a comparison of the difference with a time-delta. After performing a cumsum() on the int-representation of the boolean results he uses groupby() to cluster entries of each ungapped-series into a DataFrameGroupBy-object.

As this was written some time ago pandas now returns timedelta-objects so the comparison should be done with another timedelta-object like so (based on the mydf from above or on the reindexed df2 after copying its index to a now column through mydf2['timestamp'] = mydf2.index):

from datetime import timedelta
myTD = timedelta(minutes=60)
mydf['nogap'] = mydf['timestamp'].diff() > myTD
mydf['nogap'] = mydf['nogap'].apply(lambda x: 1 if x else 0).cumsum() 
## btw.: why not "... .apply(lambda x: int(x)) ..."?
dfg = mydf.groupby('nogap')

We now could iterate over the DataFrameGroup getting the ungapped series and do something with them. My pandas/mathplot-skills are way too immature but could we plot the group-elements into sub-plots? maybe that way the discontinuity along the time-axis could be represented in some way (in form of an interrupted axis-line or such)?

piRSquared's answer already leads to a quite usable result with the only thing kind of missing being a more striking visual feedback along the time-axis that a gap/time-jump has occurred between two values.

Maybe with the grouped Sections the width of the gap-representation could be more configurable?

解决方案

I built a new series and plotted it. This is not super elegant! But I believe gets you what you wanted.

Setup

Do this to get to your starting point

from StringIO import StringIO
import pandas as pd

text = """          timestamp       val
2016-07-25 00:00:00   0.740442
2016-07-25 01:00:00   0.842911
2016-07-25 02:00:00  -0.873992
2016-07-25 07:00:00  -0.474993
2016-07-25 08:00:00  -0.983963
2016-07-25 09:00:00   0.597011
2016-07-25 10:00:00  -2.043023
2016-07-25 12:00:00   0.304668
2016-07-25 13:00:00   1.185997
2016-07-25 14:00:00   0.920850
2016-07-25 15:00:00   0.201423
2016-07-25 16:00:00   0.842970
2016-07-25 21:00:00   1.061207
2016-07-25 22:00:00   0.232180
2016-07-25 23:00:00   0.453964"""

s1 = pd.read_csv(StringIO(text),
                 index_col=0,
                 parse_dates=[0],
                 engine='python',
                 sep='\s{2,}').squeeze()

s1

timestamp
2016-07-25 00:00:00    0.740442
2016-07-25 01:00:00    0.842911
2016-07-25 02:00:00   -0.873992
2016-07-25 07:00:00   -0.474993
2016-07-25 08:00:00   -0.983963
2016-07-25 09:00:00    0.597011
2016-07-25 10:00:00   -2.043023
2016-07-25 12:00:00    0.304668
2016-07-25 13:00:00    1.185997
2016-07-25 14:00:00    0.920850
2016-07-25 15:00:00    0.201423
2016-07-25 16:00:00    0.842970
2016-07-25 21:00:00    1.061207
2016-07-25 22:00:00    0.232180
2016-07-25 23:00:00    0.453964
Name: val, dtype: float64

Resample hourly. resample is a deferred method, meaning it expects you to pass another method afterwards so it knows what to do. I used mean. For your example, it doesn't matter because we are sampling to a higher frequency. Look it up if you care.

s2 = s1.resample('H').mean()

s2

timestamp
2016-07-25 00:00:00    0.740442
2016-07-25 01:00:00    0.842911
2016-07-25 02:00:00   -0.873992
2016-07-25 03:00:00         NaN
2016-07-25 04:00:00         NaN
2016-07-25 05:00:00         NaN
2016-07-25 06:00:00         NaN
2016-07-25 07:00:00   -0.474993
2016-07-25 08:00:00   -0.983963
2016-07-25 09:00:00    0.597011
2016-07-25 10:00:00   -2.043023
2016-07-25 11:00:00         NaN
2016-07-25 12:00:00    0.304668
2016-07-25 13:00:00    1.185997
2016-07-25 14:00:00    0.920850
2016-07-25 15:00:00    0.201423
2016-07-25 16:00:00    0.842970
2016-07-25 17:00:00         NaN
2016-07-25 18:00:00         NaN
2016-07-25 19:00:00         NaN
2016-07-25 20:00:00         NaN
2016-07-25 21:00:00    1.061207
2016-07-25 22:00:00    0.232180
2016-07-25 23:00:00    0.453964
Freq: H, Name: val, dtype: float64

Ok, so you also wanted equally sized gaps. This was a tad tricky. I used ffill(limit=1) to fill in only one space of each gap. Then I took the slice of s2 where this forward filled thing was not null. This gives me a single null for each gap.

s3 = s2[s2.ffill(limit=1).notnull()]

s3

timestamp
2016-07-25 00:00:00    0.740442
2016-07-25 01:00:00    0.842911
2016-07-25 02:00:00   -0.873992
2016-07-25 03:00:00         NaN
2016-07-25 07:00:00   -0.474993
2016-07-25 08:00:00   -0.983963
2016-07-25 09:00:00    0.597011
2016-07-25 10:00:00   -2.043023
2016-07-25 11:00:00         NaN
2016-07-25 12:00:00    0.304668
2016-07-25 13:00:00    1.185997
2016-07-25 14:00:00    0.920850
2016-07-25 15:00:00    0.201423
2016-07-25 16:00:00    0.842970
2016-07-25 17:00:00         NaN
2016-07-25 21:00:00    1.061207
2016-07-25 22:00:00    0.232180
2016-07-25 23:00:00    0.453964
Name: val, dtype: float64

Lastly, if I plotted this, I still get irregular gaps. I need str indices so that matplotlib doesn't try to expand out my dates.

s3.reindex(s3.index.strftime('%H:%M'))

timestamp
00:00    0.740442
01:00    0.842911
02:00   -0.873992
03:00         NaN
07:00   -0.474993
08:00   -0.983963
09:00    0.597011
10:00   -2.043023
11:00         NaN
12:00    0.304668
13:00    1.185997
14:00    0.920850
15:00    0.201423
16:00    0.842970
17:00         NaN
21:00    1.061207
22:00    0.232180
23:00    0.453964
Name: val, dtype: float64

I'll plot them together so we can see the difference.

f, a = plt.subplots(2, 1, sharey=True, figsize=(10, 5))
s2.plot(ax=a[0])
s3.reindex(s3.index.strftime('%H:%M')).plot(ax=a[1])

这篇关于 pandas 绘制时间序列，差距最小的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 绘制时间序列，差距最小 [英] pandas plot time-series with minimized gaps

问题描述

Setup

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 绘制时间序列，差距最小 [英] pandas plot time-series with minimized gaps

问题描述

Setup

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭