描述时间序列 pandas 中的差距 [英] Describing gaps in a time series pandas
问题描述
我正在尝试编写一个函数,该函数需要一个连续的时间序列并返回一个数据结构,该数据结构描述了数据中所有缺失的间隙(例如,具有"start"和"end"列的DF).对于时间序列来说,这似乎是一个相当普遍的问题,但是尽管搞砸了groupby,diff之类的东西并进行了探索,但我无法提出比以下更好的东西.
I'm trying to write a function that takes a continuous time series and returns a data structure which describes any missing gaps in the data (e.g. a DF with columns 'start' and 'end'). It seems like a fairly common issue for time series, but despite messing around with groupby, diff, and the like -- and exploring SO -- I haven't been able to come up with much better than the below.
对我来说,优先使用矢量化操作以保持效率.使用矢量化操作必须有一个更明显的解决方案-不是吗?谢谢大家的帮助.
It's a priority for me that this use vectorized operations to remain efficient. There has got to be a more obvious solution using vectorized operations -- hasn't there? Thanks for any help, folks.
import pandas as pd
def get_gaps(series):
"""
@param series: a continuous time series of data with the index's freq set
@return: a series where the index is the start of gaps, and the values are
the ends
"""
missing = series.isnull()
different_from_last = missing.diff()
# any row not missing while the last was is a gap end
gap_ends = series[~missing & different_from_last].index
# count the start as different from the last
different_from_last[0] = True
# any row missing while the last wasn't is a gap start
gap_starts = series[missing & different_from_last].index
# check and remedy if series ends with missing data
if len(gap_starts) > len(gap_ends):
gap_ends = gap_ends.append(series.index[-1:] + series.index.freq)
return pd.Series(index=gap_starts, data=gap_ends)
为记录起见,Pandas == 0.13.1,Numpy == 1.8.1,Python 2.7
For the record, Pandas==0.13.1, Numpy==1.8.1, Python 2.7
推荐答案
可以转换此问题以找到列表中的连续数字.找到该系列为空的所有索引,并且如果(3,4,5,6)的游程都为空,则只需提取开始和结束(3,6)
This problem can be transformed to find the continuous numbers in a list. find all the indices where the series is null, and if a run of (3,4,5,6) are all null, you only need to extract the start and end (3,6)
import numpy as np
import pandas as pd
from operator import itemgetter
from itertools import groupby
# create an example
data = [2, 3, 4, 5, 12, 13, 14, 15, 16, 17]
s = pd.series( data, index=data)
s = s.reindex(xrange(18))
print find_gap(s)
def find_gap(s):
""" just treat it as a list
"""
nullindex = np.where( s.isnull())[0]
ranges = []
for k, g in groupby(enumerate(nullindex), lambda (i,x):i-x):
group = map(itemgetter(1), g)
ranges.append((group[0], group[-1]))
startgap, endgap = zip(* ranges)
return pd.series( endgap, index= startgap )
reference:识别列表中的连续数字组
reference : Identify groups of continuous numbers in a list
这篇关于描述时间序列 pandas 中的差距的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!