pandas 绘制时间序列,差距最小 [英] pandas plot time-series with minimized gaps
问题描述
我最近开始探索大熊猫的深度,并想可视化一些包含缺口的时间序列数据,其中一些缺口很大.一个示例 mydf
:
时间戳 val0 2016-07-25 00:00:00 0.7404421 2016-07-25 01:00:00 0.8429112 2016-07-25 02:00:00 -0.8739923 2016-07-25 07:00:00 -0.4749934 2016-07-25 08:00:00 -0.9839635 2016-07-25 09:00:00 0.5970116 2016-07-25 10:00:00 -2.0430237 2016-07-25 12:00:00 0.3046688 2016-07-25 13:00:00 1.1859979 2016-07-25 14:00:00 0.92085010 2016-07-25 15:00:00 0.20142311 2016-07-25 16:00:00 0.84297012 2016-07-25 21:00:00 1.06120713 2016-07-25 22:00:00 0.23218014 2016-07-25 23:00:00 0.453964
现在我可以通过 df1.plot(x='timestamp').get_figure().show()
绘制我的 DataFrame 并且沿 x 轴的数据将被插值(显示为一行):
我想拥有的是:
- 带有数据的部分之间的可见间隙
- 为不同的间隙长度提供一致的间隙宽度
- 也许在轴上有某种形式的标记,这有助于弄清执行了一些时间跳跃这一事实.
研究这个问题我遇到过
-
I recently started to explore into the depths of pandas and would like to visualize some time-series data which contains gaps, some of them rather large. an example
mydf
:timestamp val 0 2016-07-25 00:00:00 0.740442 1 2016-07-25 01:00:00 0.842911 2 2016-07-25 02:00:00 -0.873992 3 2016-07-25 07:00:00 -0.474993 4 2016-07-25 08:00:00 -0.983963 5 2016-07-25 09:00:00 0.597011 6 2016-07-25 10:00:00 -2.043023 7 2016-07-25 12:00:00 0.304668 8 2016-07-25 13:00:00 1.185997 9 2016-07-25 14:00:00 0.920850 10 2016-07-25 15:00:00 0.201423 11 2016-07-25 16:00:00 0.842970 12 2016-07-25 21:00:00 1.061207 13 2016-07-25 22:00:00 0.232180 14 2016-07-25 23:00:00 0.453964
now i could plot my DataFrame through
df1.plot(x='timestamp').get_figure().show()
and data along the x-axis would be interpolated (appearing as one line):what i would like to have instead is:
- visible gaps between sections with data
- a consistent gap-width for differing gaps-legths
- perhaps some form of marker in the axis which helps to clarify the fact that some jumps in time are performed.
researching in this matter i've come across
which generally come close to what i'm after but the former approach would yield in simply leaving the gaps out of the plotted figure and the latter in large gaps that i would like to avoid (think of gaps that may even span a few days).
as the second approach may be closer i tried to use my timestamp-column as an index through:
mydf2 = pd.DataFrame(data=list(mydf['val']), index=mydf[0])
which allows me to fill the gaps with
NaN
through reindexing (wondering if there is a more simple solution to achive this):mydf3 = mydf2.reindex(pd.date_range('25/7/2016', periods=24, freq='H'))
leading to:
val 2016-07-25 00:00:00 0.740442 2016-07-25 01:00:00 0.842911 2016-07-25 02:00:00 -0.873992 2016-07-25 03:00:00 NaN 2016-07-25 04:00:00 NaN 2016-07-25 05:00:00 NaN 2016-07-25 06:00:00 NaN 2016-07-25 07:00:00 -0.474993 2016-07-25 08:00:00 -0.983963 2016-07-25 09:00:00 0.597011 2016-07-25 10:00:00 -2.043023 2016-07-25 11:00:00 NaN 2016-07-25 12:00:00 0.304668 2016-07-25 13:00:00 1.185997 2016-07-25 14:00:00 0.920850 2016-07-25 15:00:00 0.201423 2016-07-25 16:00:00 0.842970 2016-07-25 17:00:00 NaN 2016-07-25 18:00:00 NaN 2016-07-25 19:00:00 NaN 2016-07-25 20:00:00 NaN 2016-07-25 21:00:00 1.061207 2016-07-25 22:00:00 0.232180 2016-07-25 23:00:00 0.453964
from here on i might need to reduce consecutive entries over a certain limit with missing data to a fix number (representing my gap-width) and do something to the index-value of these entries so they are plotted differently but i got lost here i guess as i don't know how to achieve something like that.
while tinkering around i wondered if there might be a more direct and elegant approach and would be thankful if anyone knowing more about this could point me towards the right direction.
thanks for any hints and feedback in advance!
### ADDENDUM ###
After posting my question I've come across another interesting idea postend by Andy Hayden that seems helpful. He's using a column to hold the results of a comparison of the difference with a time-delta. After performing a
cumsum()
on the int-representation of the boolean results he usesgroupby()
to cluster entries of each ungapped-series into aDataFrameGroupBy
-object.As this was written some time ago pandas now returns
timedelta
-objects so the comparison should be done with anothertimedelta
-object like so (based on themydf
from above or on the reindexeddf2
after copying its index to a now column throughmydf2['timestamp'] = mydf2.index
):from datetime import timedelta myTD = timedelta(minutes=60) mydf['nogap'] = mydf['timestamp'].diff() > myTD mydf['nogap'] = mydf['nogap'].apply(lambda x: 1 if x else 0).cumsum() ## btw.: why not "... .apply(lambda x: int(x)) ..."? dfg = mydf.groupby('nogap')
We now could iterate over the DataFrameGroup getting the ungapped series and do something with them. My pandas/mathplot-skills are way too immature but could we plot the group-elements into sub-plots? maybe that way the discontinuity along the time-axis could be represented in some way (in form of an interrupted axis-line or such)?
piRSquared's answer already leads to a quite usable result with the only thing kind of missing being a more striking visual feedback along the time-axis that a gap/time-jump has occurred between two values.
Maybe with the grouped Sections the width of the gap-representation could be more configurable?
解决方案I built a new series and plotted it. This is not super elegant! But I believe gets you what you wanted.
Setup
Do this to get to your starting point
from StringIO import StringIO import pandas as pd text = """ timestamp val 2016-07-25 00:00:00 0.740442 2016-07-25 01:00:00 0.842911 2016-07-25 02:00:00 -0.873992 2016-07-25 07:00:00 -0.474993 2016-07-25 08:00:00 -0.983963 2016-07-25 09:00:00 0.597011 2016-07-25 10:00:00 -2.043023 2016-07-25 12:00:00 0.304668 2016-07-25 13:00:00 1.185997 2016-07-25 14:00:00 0.920850 2016-07-25 15:00:00 0.201423 2016-07-25 16:00:00 0.842970 2016-07-25 21:00:00 1.061207 2016-07-25 22:00:00 0.232180 2016-07-25 23:00:00 0.453964""" s1 = pd.read_csv(StringIO(text), index_col=0, parse_dates=[0], engine='python', sep='\s{2,}').squeeze() s1 timestamp 2016-07-25 00:00:00 0.740442 2016-07-25 01:00:00 0.842911 2016-07-25 02:00:00 -0.873992 2016-07-25 07:00:00 -0.474993 2016-07-25 08:00:00 -0.983963 2016-07-25 09:00:00 0.597011 2016-07-25 10:00:00 -2.043023 2016-07-25 12:00:00 0.304668 2016-07-25 13:00:00 1.185997 2016-07-25 14:00:00 0.920850 2016-07-25 15:00:00 0.201423 2016-07-25 16:00:00 0.842970 2016-07-25 21:00:00 1.061207 2016-07-25 22:00:00 0.232180 2016-07-25 23:00:00 0.453964 Name: val, dtype: float64
Resample hourly.
resample
is a deferred method, meaning it expects you to pass another method afterwards so it knows what to do. I usedmean
. For your example, it doesn't matter because we are sampling to a higher frequency. Look it up if you care.s2 = s1.resample('H').mean() s2 timestamp 2016-07-25 00:00:00 0.740442 2016-07-25 01:00:00 0.842911 2016-07-25 02:00:00 -0.873992 2016-07-25 03:00:00 NaN 2016-07-25 04:00:00 NaN 2016-07-25 05:00:00 NaN 2016-07-25 06:00:00 NaN 2016-07-25 07:00:00 -0.474993 2016-07-25 08:00:00 -0.983963 2016-07-25 09:00:00 0.597011 2016-07-25 10:00:00 -2.043023 2016-07-25 11:00:00 NaN 2016-07-25 12:00:00 0.304668 2016-07-25 13:00:00 1.185997 2016-07-25 14:00:00 0.920850 2016-07-25 15:00:00 0.201423 2016-07-25 16:00:00 0.842970 2016-07-25 17:00:00 NaN 2016-07-25 18:00:00 NaN 2016-07-25 19:00:00 NaN 2016-07-25 20:00:00 NaN 2016-07-25 21:00:00 1.061207 2016-07-25 22:00:00 0.232180 2016-07-25 23:00:00 0.453964 Freq: H, Name: val, dtype: float64
Ok, so you also wanted equally sized gaps. This was a tad tricky. I used
ffill(limit=1)
to fill in only one space of each gap. Then I took the slice ofs2
where this forward filled thing was not null. This gives me a single null for each gap.s3 = s2[s2.ffill(limit=1).notnull()] s3 timestamp 2016-07-25 00:00:00 0.740442 2016-07-25 01:00:00 0.842911 2016-07-25 02:00:00 -0.873992 2016-07-25 03:00:00 NaN 2016-07-25 07:00:00 -0.474993 2016-07-25 08:00:00 -0.983963 2016-07-25 09:00:00 0.597011 2016-07-25 10:00:00 -2.043023 2016-07-25 11:00:00 NaN 2016-07-25 12:00:00 0.304668 2016-07-25 13:00:00 1.185997 2016-07-25 14:00:00 0.920850 2016-07-25 15:00:00 0.201423 2016-07-25 16:00:00 0.842970 2016-07-25 17:00:00 NaN 2016-07-25 21:00:00 1.061207 2016-07-25 22:00:00 0.232180 2016-07-25 23:00:00 0.453964 Name: val, dtype: float64
Lastly, if I plotted this, I still get irregular gaps. I need
str
indices so thatmatplotlib
doesn't try to expand out my dates.s3.reindex(s3.index.strftime('%H:%M')) timestamp 00:00 0.740442 01:00 0.842911 02:00 -0.873992 03:00 NaN 07:00 -0.474993 08:00 -0.983963 09:00 0.597011 10:00 -2.043023 11:00 NaN 12:00 0.304668 13:00 1.185997 14:00 0.920850 15:00 0.201423 16:00 0.842970 17:00 NaN 21:00 1.061207 22:00 0.232180 23:00 0.453964 Name: val, dtype: float64
I'll plot them together so we can see the difference.
f, a = plt.subplots(2, 1, sharey=True, figsize=(10, 5)) s2.plot(ax=a[0]) s3.reindex(s3.index.strftime('%H:%M')).plot(ax=a[1])
这篇关于 pandas 绘制时间序列,差距最小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!