描述时间序列 pandas 中的差距 [英] Describing gaps in a time series pandas

查看:75
本文介绍了描述时间序列 pandas 中的差距的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个函数,该函数需要一个连续的时间序列并返回一个数据结构,该数据结构描述了数据中所有缺失的间隙(例如,具有"start"和"end"列的DF).对于时间序列来说,这似乎是一个相当普遍的问题,但是尽管搞砸了groupby,diff之类的东西并进行了探索,但我无法提出比以下更好的东西.

I'm trying to write a function that takes a continuous time series and returns a data structure which describes any missing gaps in the data (e.g. a DF with columns 'start' and 'end'). It seems like a fairly common issue for time series, but despite messing around with groupby, diff, and the like -- and exploring SO -- I haven't been able to come up with much better than the below.

对我来说,优先使用矢量化操作以保持效率.使用矢量化操作必须有一个更明显的解决方案-不是吗?谢谢大家的帮助.

It's a priority for me that this use vectorized operations to remain efficient. There has got to be a more obvious solution using vectorized operations -- hasn't there? Thanks for any help, folks.

import pandas as pd


def get_gaps(series):
    """
    @param series: a continuous time series of data with the index's freq set
    @return: a series where the index is the start of gaps, and the values are
         the ends
    """
    missing = series.isnull()
    different_from_last = missing.diff()

    # any row not missing while the last was is a gap end        
    gap_ends = series[~missing & different_from_last].index

    # count the start as different from the last
    different_from_last[0] = True

    # any row missing while the last wasn't is a gap start
    gap_starts = series[missing & different_from_last].index        

    # check and remedy if series ends with missing data
    if len(gap_starts) > len(gap_ends):
         gap_ends = gap_ends.append(series.index[-1:] + series.index.freq)

    return pd.Series(index=gap_starts, data=gap_ends)

为记录起见,Pandas == 0.13.1,Numpy == 1.8.1,Python 2.7

For the record, Pandas==0.13.1, Numpy==1.8.1, Python 2.7

推荐答案

可以转换此问题以找到列表中的连续数字.找到该系列为空的所有索引,并且如果(3,4,5,6)的游程都为空,则只需提取开始和结束(3,6)

This problem can be transformed to find the continuous numbers in a list. find all the indices where the series is null, and if a run of (3,4,5,6) are all null, you only need to extract the start and end (3,6)

import numpy as np
import pandas as pd
from operator import itemgetter
from itertools import groupby


# create an example 
data = [2, 3, 4, 5, 12, 13, 14, 15, 16, 17]
s = pd.series( data, index=data)
s = s.reindex(xrange(18))
print find_gap(s)  


def find_gap(s): 
    """ just treat it as a list
    """ 
    nullindex = np.where( s.isnull())[0]
    ranges = []
    for k, g in groupby(enumerate(nullindex), lambda (i,x):i-x):
        group = map(itemgetter(1), g)
        ranges.append((group[0], group[-1]))
    startgap, endgap = zip(* ranges) 
    return pd.series( endgap, index= startgap )

reference:识别列表中的连续数字组

reference : Identify groups of continuous numbers in a list

这篇关于描述时间序列 pandas 中的差距的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆