如何使用python连接组件组合基于日期的记录? [英] How to combine records based on date using python connected components?

查看:121
本文介绍了如何使用python连接组件组合基于日期的记录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个记录列表(person_id,start_date,end_date),如下所示:

  person_records = [['1' ,'08 / 01/2011','08 / 31/2011'],
['1','09 / 01/2011','09 / 30/2011'],
[ 1','11 / 01/2011','11 / 30/2011'],
['1','12 / 01/2011','12 / 31/2011'],
['1','01 / 01/2012','01 / 31/2012'],
['1','03 / 01/2012','03 / 31/2012']]

每个人的记录按照start_date的升序进行排序。通过组合基于日期的记录并将第一时段的start_date记录为开始日期并且将上一时段的end_date记录为结束日期来合并时段。但是,如果一个周期结束到下一个开始的时间为32天或更短,我们应该将其视为连续周期。否则,我们将此视为两个阶段:

  consolidated_person_records = [['1','08 / 01/2011',' 2011年9月30日'],
['1','11 / 01/2011','03 / 31/2012']]

有什么办法可以使用python连接组件来做到这一点?

解决方案

我想到了你的问题,我最初编写了一个例程,将日期间隔映射到1D二进制数组中,其中数组中的每个条目都是一天,连续的日期是连续的条目。使用此数据结构,可以执行扩张和侵蚀以填充小间隙,从而合并间隔,然后将合并的间隔映射回日期范围。因此,我们使用标准的栅格连接组件逻辑来解决你的问题,根据你的想法(一个基于图形的连接组件也可以工作......)

这工作正常,如果你真的感兴趣,我可以发布代码,但是我想知道前一个应用的优点是简单的迭代遍历(预先排序的)日期范围并将下一个合并到当前的差距很小。



以下是简单例程的代码,使用示例数据运行约需120微秒。如果您通过重复10,000次来扩展样本数据,则此例程在我的计算机上大约需要1秒。



当我计算基于形态的解决方案时,它大约慢了两倍。在某些情况下它可能会更好,但我建议我们先尝试一下,看看是否有一个真正的问题需要采用不同的算法。

  from datetime import datetime 
from datetime import timedelta
import numpy as np

问题中提供的样本数据:

  SAMPLE_DATA = [['1','08/01 / 2011 ','08 / 31/2011'],
['1','09 / 01/2011','09 / 30/2011'],
['1','11 / 01 / 2011','11 / 30/2011'],
['1','12 / 01/2011','12 / 31/2011'],
['1','01 / 01/2012','01 / 31/2012'],
['1','03 / 01/2012','03 / 31/2012'],
['2', '2011/11/2011','11 / 30/2011'],
['2','12/11/2011','12 / 31/2011'],
['2 ','01 / 11/2014','01/31/2014'],
['2','03/11/2014','03 / 31/2014']]

简单的方法:

  def simple_method(in_data = SAMPLE_DATA,person ='1',fill_gap_days = 31,printit = False) :
date_format_str =%m /%d /%Y
dat = np.array(in_data)
dat = dat [dat [:, 0] == person,1:] #只是这个人的数据
#假定日期间隔已经按开始日期排序
new_intervals = []
cur_start = None
cur_end = None
gap_days = timedelta(days = fill_gap_days)
for(s_str,e_str)in dat:
dt_start = datetime.strptime(s_str,date_format_str)
dt_end = datetime.strptime(e_str,date_format_str)
如果cur_end是无:
cur_start = dt_start
cur_end = dt_end
continue
else:
如果cur_end + gap_days> = dt_start:
#合并,保持现有cur_start,exten (cur_start,cur_end))
cur_start = dt_start
cur_end = dt_end
#确保最终间隔已保存
new_intervals.append((cur_start,cur_end))

如果printit:
print_it人,new_intervals,date_format_str)

返回new_intervals

以下是简单的漂亮打印功能打印范围。

  def print_it(person,consolidated_ranges,fmt):
for(s,e )in consolidated_ranges:
print(person,s.strftime(fmt),e.strftime(fmt))

在ipython中运行如下。

 在[10]中:_ = simple_method(printit = True )
1 08/01/2011 09/30/2011
1 11/01/2011 2012/03/31

使用%timeit宏运行ipython:

  In [8]:%timeit simple_method (in_data = SAMPLE_DATA)
10000循环,最好3:每循环118μs

在[9]中:%timeit simple_method(in_data = SAMPLE_DATA * 10000)
1个循环,最好的3:每循环1.06秒


正如我在我的回应中所描述的那样,我确实创建了一个形态/ 1D连接组件版本,在我的计算中,它的速度大约慢了两倍。但为了完整起见,我将展示形态学方法,也许其他人会了解是否存在大面积加速的地方。

 #使用与之前的代码相同的进口额,再加一个
导入日历作为cal
$ b $ def def make_occupancy_array(start_year,end_year):

表示开始和结束年间之间的时间,包括一维数组
'像素',其中每个像素对应一天。连续天数因此映射到
连续的像素,我们可以在这个1D数组上进行形态学处理,以
关闭日期范围之间的小间隙。

years_days = [(年,366如果cal.isleap(年)其他365 )(年份日期)(年份日期)#例如[(2011,365),(2012,366),...]以ndarray格式
total_num_days = YD [:,1] .sum()
occupancy = np.zeros((total_num_days,),dtype ='int')
return YD, ncy

通过占用数组来表示时间间隔,我们需要两个函数将日期映射到位置
$ b

  def map_date_to_position(dt,YD):

将日期时间值映射到占位数组中的一个位置

#开始位置是dt1年中的第1天的偏移量,
#加上年份的第1天对于dt1(年中的某一天是基于1的索引)
yr = dt.year
在YD [:,0]中声明yr#guard ... YD应包含此人日期的所有年份
position = YD [YD [:, 0]< yr,1] .sum()#今年前一年的天数之和
position + = dt.timetuple()。tm_yday - 1
回报头寸


def map_position_to_date(pos,YD):

map_date_to_position的逆映射,将
占用数组中的一个位置映射回日期时间值

yr_offsets = np.cumsum(YD [:, 1])$ ​​b $ b day_offsets = yr_offsets - pos
idx = np.flatnonzero(day_offsets> 0)
year = YD [idx,0]
day_of_year = pos如果idx == 0 else pos - yr_offsets [idx-1]
#构造日期时间作为年份的第一年加上日期的年份偏移
dt = datetime .strptime(str(year),%Y)
dt + = timedelta(days = int(day_of_year)+1)
return dt
pre>

以下函数填充占用数组的相关部分,给定开始日期和结束日期(包括开始日期和结束日期),并可选择延伸范围的结尾一个间隙填充边距比如单面扩张)。

  def set_occupancy(dt1,dt2,YD,占用率,fill_gap_days = 0):

对于从dt1开始的日期范围, dt2,
将占用矢量中对应的像素设置为1.
如果fill_gap_days> 0,那么结束'像素'被这个多个位置延长
(扩大),这样我们就可以填充彼此接近的间隔之间的间隔b $ b。

pos1 = map_date_to_position(dt1,YD)
pos2 = map_date_to_position(dt2,YD)+ fill_gap_days
占用[pos1:pos2] = 1

一旦我们在占用数组中有统一间隔,我们需要将它们读回日期间隔,如果我们之前已经完成填补空白的话,我们可以进行单边腐蚀。

  def get_occupancy_intervals(OCC,fill_gap_days = 0):

在OCC数组中找到相应的
到'扩张的'连续位置的运行,然后通过减去
fill_gap_days,
'erode'返回到正确的结束日期。

starts = np.flatnonzero(np.diff(OCC)> 0)#其中nonzeros的运行开始
ends = np.flatnonzero(np.diff(OCC)< ; 0)#其中非结束运行结束
结束 - = fill_gap_days#在扩展前退回到原始长度
返回[(s,e)为zip(开始,结束) ]

把它放在一起...

  def morphology_method(in_data = SAMPLE_DATA,person ='1',fill_gap_days = 31,printit = False):
date_format_str =%m /%d /%Y
dat = np.array(in_data)
dat = dat [dat [:, 0] == person,1:]#只是这个人的数据

#间隔这个人,得到开始和结束年份
#我们假设数据已经排序
#start_year = datetime.strptime(dat [0,0],date_format_str)
#end_year = datetime.strptime (dat [-1,1],date_format_str)
start_times = [datetime.strptime(d,date_format_str)for d in dat [:,0]]
end_times = [datetime.strptime(d,date_format_str)for d in dat [:, 1]]
start_year = start_times [0] .year
end_year = end_times [-1] .year

#创建占位数组,扩大以便每个间隔
#被fill_gap_days扩展为'填充'间隔中的小间隙

YD,OCC = make_occupancy_array(start_year ,end_year)
(s,e)在zip(start_times,end_times)中:
set_occupancy(s,e,YD,OCC,fill_gap_days)

# OCC在填补缺口后,
#并将结束日期修剪回原来的位置。
consolidated_pos = get_occupancy_intervals(OCC,fill_gap_days)

#地图位置返回日期时间
consolidated_ranges = [(map_position_to_date(s,YD),map_position_to_date(e,YD))如果是printit:
print_it(person,consolidated_ranges,date_format_str)

return consolidated_ranges
(如果是
(s,e)在consolidated_pos]


I have a list of records (person_id, start_date, end_date) as follows:

person_records = [['1', '08/01/2011', '08/31/2011'],
                 ['1', '09/01/2011', '09/30/2011'],
                 ['1', '11/01/2011', '11/30/2011'],
                 ['1', '12/01/2011', '12/31/2011'],
                 ['1', '01/01/2012', '01/31/2012'],
                 ['1', '03/01/2012', '03/31/2012']]

The records for each person are sorted in an ascending order of start_date. The periods are consolidated by combining the records based on the dates and recording the start_date of the first period as the start date and the end_date of the last period as the end date. BUT, If the time between the end of one period and the start of the next is 32 days or less, we should treat this as continuous period. Otherwise, we treat this as two periods:

consolidated_person_records = [['1', '08/01/2011', '09/30/2011'],
                               ['1', '11/01/2011', '03/31/2012']]

Is there any way to do this using the python connected components?

解决方案

I thought about your question, and I originally wrote a routine that would map the date intervals into a 1D binary array, where each entry in the array is a day, and consecutive days are consecutive entries. With this data structure, you can perform dilation and erosion to fill in small gaps, thus merging the intervals, and then map the consolidated intervals back into date ranges. Thus we use standard raster connected components logic to solve your problem, as per your idea (a graph-based connected components could work as well...)

This works fine, and I can post the code if you are really interested, but then I wondered what the advantages are of the former apporach over the simple routine of just iterating through the (pre-sorted) date ranges and merging the next into the current if the gap is small.

Here is the code for the simple routine, and it takes about 120 micro seconds to run using the sample data. If you expand the sample data by repeating it 10,000 times, this routine takes about 1 sec on my computer.

When I timed the morphology based solution, it was about 2x slower. It might work better under certain circumstances, but I would suggest we try simple first, and see if there's a real problem that requires a different algorithmic approach.

from datetime import datetime
from datetime import timedelta
import numpy as np

The sample data provided in the question:

SAMPLE_DATA = [['1', '08/01/2011', '08/31/2011'],
               ['1', '09/01/2011', '09/30/2011'],
               ['1', '11/01/2011', '11/30/2011'],
               ['1', '12/01/2011', '12/31/2011'],
               ['1', '01/01/2012', '01/31/2012'],
               ['1', '03/01/2012', '03/31/2012'],
               ['2', '11/11/2011', '11/30/2011'],
               ['2', '12/11/2011', '12/31/2011'],
               ['2', '01/11/2014', '01/31/2014'],
               ['2', '03/11/2014', '03/31/2014']]

The simple approach:

def simple_method(in_data=SAMPLE_DATA, person='1', fill_gap_days=31, printit=False):
    date_format_str = "%m/%d/%Y"
    dat = np.array(in_data)
    dat = dat[dat[:, 0] == person, 1:]  # just this person's data
    # assume date intervals are already sorted by start date
    new_intervals = []
    cur_start = None
    cur_end = None
    gap_days = timedelta(days=fill_gap_days)
    for (s_str, e_str) in dat:
        dt_start = datetime.strptime(s_str, date_format_str)
        dt_end = datetime.strptime(e_str, date_format_str)
        if cur_end is None:
            cur_start = dt_start
            cur_end = dt_end
            continue
        else:
            if cur_end + gap_days >= dt_start:
                # merge, keep existing cur_start, extend cur_end
                cur_end = dt_end
            else:
                # new interval, save previous and reset current to this
                new_intervals.append((cur_start, cur_end))
                cur_start = dt_start
                cur_end = dt_end
    # make sure final interval is saved
    new_intervals.append((cur_start, cur_end))

    if printit:
        print_it(person, new_intervals, date_format_str)

    return new_intervals

And here's the simple pretty printing function to print the ranges.

def print_it(person, consolidated_ranges, fmt):
    for (s, e) in consolidated_ranges:
        print(person, s.strftime(fmt), e.strftime(fmt))

Running in ipython as follows. Note that printing the result can be turned off for timing the computation.

In [10]: _ = simple_method(printit=True)
1 08/01/2011 09/30/2011
1 11/01/2011 03/31/2012

Running in ipython with %timeit macro:

In [8]: %timeit simple_method(in_data=SAMPLE_DATA)
10000 loops, best of 3: 118 µs per loop

In [9]: %timeit simple_method(in_data=SAMPLE_DATA*10000)
1 loops, best of 3: 1.06 s per loop

[EDIT 8 Feb 2016: To make a long answer longer...] As I prefaced in my response, I did create a morphological / 1D connected components version and in my timing it was about 2x slower. But for the sake of completeness, I'll show the morphological method, and maybe others will have insight on if there's a big area for speed-up left somewhere in it.

#using same imports as previous code with one more
import calendar as cal

def make_occupancy_array(start_year, end_year):
    """
    Represents the time between the start and end years, inclusively, as a 1-D array
    of 'pixels', where each pixel corresponds to a day. Consecutive days are thus
    mapped to consecutive pixels. We can perform morphology on this 1D array to
    close small gaps between date ranges.
    """
    years_days = [(yr, 366 if cal.isleap(yr) else 365) for yr in range(start_year, end_year+1)]
    YD = np.array(years_days)  # like [ (2011, 365), (2012, 366), ... ] in ndarray form
    total_num_days = YD[:, 1].sum()
    occupancy = np.zeros((total_num_days,), dtype='int')
    return YD, occupancy

With the occupancy array to represent the time intervals, we need two functions to map from dates to positions in the array and the inverse.

def map_date_to_position(dt, YD):
    """
    Maps the datetime value to a position in the occupancy array
    """
    # the start position is the offset to day 1 in the dt1,year,
    # plus the day of year - 1 for dt1 (day of year is 1-based indexed)
    yr = dt.year
    assert yr in YD[:, 0]  # guard...YD should include all years for this person's dates
    position = YD[YD[:, 0] < yr, 1].sum()  # the sum of the days in year before this year
    position += dt.timetuple().tm_yday - 1
    return position


def map_position_to_date(pos, YD):
    """
    Inverse of map_date_to_position, this maps a position in the
    occupancy array back to a datetime value
    """
    yr_offsets = np.cumsum(YD[:, 1])
    day_offsets = yr_offsets - pos
    idx = np.flatnonzero(day_offsets > 0)[0]
    year = YD[idx, 0]
    day_of_year = pos if idx == 0 else pos - yr_offsets[idx-1]
    # construct datetime as first of year plus day offset in year
    dt = datetime.strptime(str(year), "%Y")
    dt += timedelta(days=int(day_of_year)+1)
    return dt

The following function fills the relevant part of the occupancy array given start and end dates (inclusive) and optionally extends the end of the range by a gap-filling margin (like 1-sided dilation).

def set_occupancy(dt1, dt2, YD, occupancy, fill_gap_days=0):
    """
    For a date range starting dt1 and ending, inclusively, dt2,
    sets the corresponding 'pixels' in occupancy vector to 1.
    If fill_gap_days > 0, then the end 'pixel' is extended
    (dilated) by this many positions, so that we can fill
    the gaps between intervals that are close to each other.
    """
    pos1 = map_date_to_position(dt1, YD)
    pos2 = map_date_to_position(dt2, YD) + fill_gap_days
    occupancy[pos1:pos2] = 1

Once we have the consolidated intervals in the occupancy array, we need to read them back out into date intervals, optionally performing 1-sided erosion if we'd previously done gap filling.

def get_occupancy_intervals(OCC, fill_gap_days=0):
    """
    Find the runs in the OCC array corresponding
    to the 'dilated' consecutive positions, and then
    'erode' back to the correct end dates by subtracting
    the fill_gap_days.
    """
    starts = np.flatnonzero(np.diff(OCC) > 0)  # where runs of nonzeros start
    ends = np.flatnonzero(np.diff(OCC) < 0)  # where runs of nonzeros end
    ends -= fill_gap_days  # erode back to original length prior to dilation
    return [(s, e) for (s, e) in zip(starts, ends)]

Putting it all together...

def morphology_method(in_data=SAMPLE_DATA, person='1', fill_gap_days=31, printit=False):
    date_format_str = "%m/%d/%Y"
    dat = np.array(in_data)
    dat = dat[dat[:, 0] == person, 1:]  # just this person's data

    # for the intervals of this person, get starting and ending years
    # we assume the data is already sorted
    #start_year = datetime.strptime(dat[0, 0], date_format_str)
    #end_year = datetime.strptime(dat[-1, 1], date_format_str)
    start_times = [datetime.strptime(d, date_format_str) for d in dat[:, 0]]
    end_times = [datetime.strptime(d, date_format_str) for d in dat[:, 1]]
    start_year = start_times[0].year
    end_year = end_times[-1].year

    # create the occupancy array, dilated so that each interval
    # is extended by fill_gap_days to 'fill in' the small gaps
    # between intervals
    YD, OCC = make_occupancy_array(start_year, end_year)
    for (s, e) in zip(start_times, end_times):
        set_occupancy(s, e, YD, OCC, fill_gap_days)

    # return the intervals from OCC after having filled gaps,
    # and trim end dates back to original position.
    consolidated_pos = get_occupancy_intervals(OCC, fill_gap_days)

    # map positions back to date-times
    consolidated_ranges = [(map_position_to_date(s, YD), map_position_to_date(e, YD)) for
                           (s, e) in consolidated_pos]

    if printit:
        print_it(person, consolidated_ranges, date_format_str)

    return consolidated_ranges

这篇关于如何使用python连接组件组合基于日期的记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆