Python-在每个动态间隔内计算日期范围内的邮件频率 [英] Python- Count the frequency of messages within a date range within per dynamic interval

查看:798
本文介绍了Python-在每个动态间隔内计算日期范围内的邮件频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

计算每个时间间隔内的日期范围内的邮件数量。我只使用python 2.6.5。



例如
开始日期:12/11/2014
结束日期:12/12 / 2014
开始时间:02:00
结束时间:02:05
间隔:每1分钟



所以这意味着如何许多消息是从开始日期12/11到结束日期12/12之间的每一分钟间隔
所以我的输出将如下所示:(不需要字符串最小和消息)

  datetime(2014,12,11,2,0)min:0消息,
datetime(2014,12,11, ,1)min:1消息,
datetime(2014,12,11,2,2)min:2消息,
datetime(2014,12,11,2,3)min:1消息,
datetime(2014,12,11,2,4)min:0消息,
datetime(2014,12,11,2,5)min:0消息

我相信我完成了这一切,但是对于大型数据集来说非常慢,我认为因为它使用两个循环,如果第二个循环非常大,它需要很长时间,并为第一个循环的每次迭代需要更好的程序或算法来完成这一点?



编辑:对于没有消息的间隔,我需要包含零。我也试图找到高峰,最小和平均值。

  from datetime import date,datetime,timedelta,time 

def perdelta(start,end,delta):
curr = start
while curr<结束:
yield curr
curr + = delta


def rdata(table,fromDate,toDate,fromTime,toTime,interval):
date_to_alert = {}
start_date = datetime(fromDate.year,fromDate.month,fromDate.day,fromTime.hour,fromTime.minute)
end_date = datetime(toDate.year,toDate.month,toDate.day, toTime.hour,toTime.minute)

list_range_of_dates = []
在perdelta中的date_range(start_date,end_date,interval):
list_range_of_dates.append(date_range)
print list_range_of_dates
index = 0
for list_range_of_dates中的date_range:
表中的行:

print('first_alerted_time 1:%s index:%s len:% s'%(row ['first_alerted_time'],index,len(list_range_of_dates)-1))
如果row ['first_alerted_time']和row ['first_alerted_time']> = list_range_of_dates [index]和row [ first_alerted_time']< list_range_of_dates [index + 1]:
print('开始日期:%s'%list_range_of_dates [index])
print('first_alerted_time:%s'%row ['first_alerted_time'])
print('end date:%s'%list_range_of_dates [index + 1])$ ​​b $ b如果list_range_of_dates [index] in date_to_alert:
date_to_alert [list_range_of_dates [index]]。append(row)
else :
date_to_alert [list_range_of_dates [index]] = [row]

elif row ['first_alerted_time']:
print('first_alerted_time 2:%s'%row ['first_alerted_time '])
index = index + 1

打印date_to_alert键
date_to_alert.items()中的值:
date_to_alert [key] = len(value)
打印date_to_alert
t1 = []
如果date_to_alert:
avg = sum(date_ to_alert.values())/ len(date_to_alert.keys())
for date_period,date_to_alert.items()中的num_of_alerts:
#[date_period] = date_to_alert [date_period]
t1.append ([date_period,num_of_alerts,avg])
print t1
return t1

def main():
example_table = [
{'first_alerted_time' datetime(2014,12,11,2,1,45)},
{'first_alerted_time':datetime(2014,12,11,2,2,33)},
{'first_alerted_time' datetime(2014,12,11,2,2,45)},
{'first_alerted_time':datetime(2014,12,11,2,3,45)},
]
example_table.sort()
print example_table
print rdata(example_table,date(2014,12,11),date(2014,12,12),time(00,00,00),time(00 ,00,00),timedelta(分钟= 1))

更新:
首次尝试要改进:



默认字典方法

  de f default_dict_approach(table,fromDate,toDate,fromTime,toTime,interval):
从集合导入defaultdict

t1 = []
start_date = datetime.combine(fromDate,fromTime)
end_date = datetime.combine(toDate,toTime)+ interval


times =(d ['first_alerted_time'] for d in table)
counter = defaultdict int)
代表dt的次数:
如果start_date <= dt< end_date:
counter [to_s(dt - start_date)// to_s(interval)] + = 1

date_to_alert = {}
date_to_alert = dict((ts * interval + start_date ,count)for ts,count in counter.iteritems())

max_num,min_num,avg = 0,0,0
list_of_dates = list(perdelta(start_date,end_date,interval))
如果date_to_alert:
freq_values = date_to_alert.values()
size_freq_values = len(freq_values)
avg = sum(freq_values)/ size_freq_values
max_num = max(freq_values)
如果size_freq_values == len(list_of_dates):
min_num = min(freq_values)
else:
min_num = 0
在list_of_dates中的date_period:
if date_period in date_to_alert:
t1.append([date_period.strftime(%Y-%m-%d%H:%M),date_to_alert [date_period],avg,max_num,min_num])
else:
t1.append([date_period.strftime(%Y - %m-%d%H:%M),0,avg,max_num,min_num])

return(t1,max_num,min_num,avg)

numpy方法

  def numpy_approach ,fromDate,toDate,fromTime,toTime,interval):
date_to_alert = {}
start_date = datetime.combine(fromDate,fromTime)
end_date = datetime.combine(toDate,toTime)+ interval

list_range_of_dates = []
在perdelta中的date_range(start_date,end_date,interval):
list_range_of_dates.append(date_range)
#print list_range_of_dates

index = 0
times = np.fromiter((d ['first_alerted_time'] for d in table),
dtype ='datetime64 [us]',count = len(table))

打印时间
bins = np.fromiter(list_range_of_dates,
dtype = times.dtype)
打印bin
a,bins = np.histogram(times, bin)
print(dict( zip(bins [a.nonzero()]。tolist(),a [a.nonzero()])))


解决方案

你想实现 numpy.histogram() 日期:

  import numpy as np 

times = np.fromiter((d ['first_alerted_time'] for d in example_table),
dtype ='datetime64 [us]',count = len (example_table))
bins = np.fromiter(date_range(start_date,end_date + step,step),
dtype = times.dtype)
a,bins = np.histogram(times,bin)
print(dict(zip(bins [a.nonzero()]。tolist(),a [a.nonzero()])))



输出



  {datetime.datetime(2014,12,11,2,0):3,
datetime.datetime(2014,12,11,2,3): 1}

numpy.historgram() wo即使步骤不是常数,并且 times 数组是未排序的。否则,如果您决定使用 numpy ,则可以优化通话。



有两种一般方法可以使用在Python 2.6中实现 numpy.historgram




  • itertools.groupby 基于:输入应该被排序,但它允许实现单程,常量存储器算法

  • 集合。 defaultdict - 基于:输入可能是未排序的,它也是一个线性算法,但它是内存中的 O(number_of_nonempty_bins)


groupby()解决方案:



来自itertools import groupby

times =(d ['first_alerted_time'] for d in example_table)
bins = date_range(start_date,end_date +
def key(dt,end = [next(bins)]):
while end [0]< = dt:
end [0] = next(bins)
return end [0]
print dict((end-step,sum(1 for _ in g))for end,g in groupby(times ,key = key))

它生成与 histogram()相同的输出,



注意:小于 start_date 的所有日期都被放在在第一个bin中。



defaultdict()解决方案



  from collections import defaultdict 

def to_s(td):#for Python 2.6
return td.days * 86400 + td.seconds#注意:忽略微秒

times =(d ['first_alerted_time'] for d in example_table)
counter = defaultdict(int)
for dt in times:
if start_date< = dt< end_date:
counter [to_s(dt - start_date)// to_s(step)] + = 1

print dict((ts * step + start_date,count)for ts,count in counter .iteritems())

输出与其他两个解决方案相同。


Count the number of messages within a date range per interval. I"m using python 2.6.5 only.

For example Start date: 12/11/2014 End date: 12/12/2014 Start time: 02:00 End time: 02:05 Interval: Per 1 min

So this translates to how many messages are between each interval of a minute from start date 12/11 to end date 12/12. So my out put will look like this: (does not need to have strings min and messages)

datetime(2014, 12, 11, 2, 0) min : 0 messages,
datetime(2014, 12, 11, 2, 1) min: 1 message,
datetime(2014, 12, 11, 2, 2) min: 2 messages, 
datetime(2014, 12, 11, 2, 3) min: 1 message,
datetime(2014, 12, 11, 2, 4) min : 0 messages,
datetime(2014, 12, 11, 2, 5) min : 0 messages

I believe I accomplish this but its very slow with large datasets. I think because it uses two loops and if the the second loop is extremely large then it takes very long time and does it for each iteration of the first loop. I need better procedure or alrogithm to accomplish this?

Edit: I need to include zero for intervals that do not have messages. I'm also trying to find peak,min and average.

from datetime import date,datetime, timedelta, time

def perdelta(start, end, delta):
    curr = start
    while curr < end:
        yield curr
        curr += delta


def rdata(table, fromDate, toDate, fromTime, toTime, interval): 
    date_to_alert = {}
    start_date = datetime(fromDate.year, fromDate.month, fromDate.day, fromTime.hour, fromTime.minute)
    end_date = datetime(toDate.year, toDate.month, toDate.day, toTime.hour, toTime.minute)

    list_range_of_dates = []
    for date_range in perdelta(start_date ,end_date ,interval):
        list_range_of_dates.append(date_range)
    print list_range_of_dates
    index = 0
    for date_range in list_range_of_dates:
        for row in table:    

            print('first_alerted_time 1: %s index: %s len: %s' % ( row['first_alerted_time'], index, len(list_range_of_dates)-1))          
            if row['first_alerted_time'] and row['first_alerted_time'] >= list_range_of_dates[index] and row['first_alerted_time'] < list_range_of_dates[index + 1]:
                print('Start date: %s' % list_range_of_dates[index] )
                print('first_alerted_time: %s' % row['first_alerted_time'])
                print('end date: %s' % list_range_of_dates[index + 1])
                if list_range_of_dates[index] in date_to_alert:
                    date_to_alert[list_range_of_dates[index]].append(row)                                     
                else:
                    date_to_alert[list_range_of_dates[index]] = [row]                       

            elif row['first_alerted_time']:
                print('first_alerted_time 2: %s' % row['first_alerted_time'])        
        index = index + 1  

        print   date_to_alert    
for key, value in date_to_alert.items():
    date_to_alert[key] = len(value)
print   date_to_alert
t1 = []
if date_to_alert:
    avg = sum(date_to_alert.values())/len(date_to_alert.keys())
    for date_period, num_of_alerts in date_to_alert.items():
        #[date_period] = date_to_alert[date_period]
        t1.append( [ date_period, num_of_alerts, avg] )
print t1
return t1

def main():
    example_table = [ 
                {'first_alerted_time':datetime(2014, 12, 11, 2, 1,45)},
                {'first_alerted_time':datetime(2014, 12, 11, 2, 2,33)},
                {'first_alerted_time':datetime(2014, 12, 11, 2, 2,45)},
                {'first_alerted_time':datetime(2014, 12, 11, 2, 3,45)},
                ]
    example_table.sort()     
    print example_table
    print rdata(example_table, date(2014,12,11), date(2014,12,12), time(00,00,00), time(00,00,00), timedelta(minutes=1)) 

Update: First attempt for improvement:

Default Dictionary approach

def default_dict_approach(table, fromDate, toDate, fromTime, toTime, interval):
    from collections import defaultdict

    t1 = []
    start_date = datetime.combine(fromDate, fromTime)
    end_date = datetime.combine(toDate, toTime)+ interval


    times = (d['first_alerted_time'] for d in table)
    counter = defaultdict(int)
    for dt in times:
        if start_date <= dt < end_date:
            counter[to_s(dt - start_date) // to_s(interval)] += 1

    date_to_alert = {}
    date_to_alert = dict((ts*interval + start_date, count) for ts, count in counter.iteritems())

    max_num,min_num,avg = 0,0,0
    list_of_dates = list(perdelta(start_date,end_date,interval))
    if date_to_alert:
        freq_values = date_to_alert.values()
        size_freq_values = len(freq_values)
        avg = sum(freq_values)/ size_freq_values
        max_num = max(freq_values)
        if size_freq_values == len(list_of_dates):
            min_num = min(freq_values)
        else:
            min_num = 0
        for date_period in list_of_dates:
            if date_period in date_to_alert:
                t1.append([ date_period.strftime("%Y-%m-%d %H:%M"), date_to_alert[date_period], avg, max_num, min_num])
            else:
                t1.append([ date_period.strftime("%Y-%m-%d %H:%M"), 0, avg, max_num, min_num])

    return (t1,max_num,min_num,avg)

numpy approach

def numpy_approach(table, fromDate, toDate, fromTime, toTime, interval):
    date_to_alert = {}
    start_date = datetime.combine(fromDate, fromTime)
    end_date = datetime.combine(toDate, toTime)+ interval

    list_range_of_dates = []
    for date_range in perdelta(start_date ,end_date ,interval):
        list_range_of_dates.append(date_range)
    #print list_range_of_dates

    index = 0
    times = np.fromiter((d['first_alerted_time'] for d in table),
                     dtype='datetime64[us]', count=len(table))

    print times
    bins = np.fromiter(list_range_of_dates,
                       dtype=times.dtype)                
    print bin                 
    a, bins = np.histogram(times, bins)  
    print(dict(zip(bins[a.nonzero()].tolist(), a[a.nonzero()])))

解决方案

You want to implement numpy.histogram() for dates:

import numpy as np

times = np.fromiter((d['first_alerted_time'] for d in example_table),
                     dtype='datetime64[us]', count=len(example_table))
bins = np.fromiter(date_range(start_date, end_date + step, step),
                   dtype=times.dtype)
a, bins = np.histogram(times, bins)
print(dict(zip(bins[a.nonzero()].tolist(), a[a.nonzero()])))

Output

{datetime.datetime(2014, 12, 11, 2, 0): 3,
 datetime.datetime(2014, 12, 11, 2, 3): 1}

numpy.historgram() works even if the step is not constant and times array is unsorted. Otherwise the call can be optimized if you decide to use numpy.

There are two general approaches that you could use on Python 2.6 to implement numpy.historgram:

  • itertools.groupby-based: the input should be sorted but it allows to implement a single-pass, constant memory algorithm
  • collections.defaultdict-based: the input may be unsorted and it is also a linear algorithm but it is O(number_of_nonempty_bins) in memory

groupby()-based solution:

from itertools import groupby

times = (d['first_alerted_time'] for d in example_table)
bins = date_range(start_date, end_date + step, step)
def key(dt, end=[next(bins)]):
    while end[0] <= dt:
        end[0] = next(bins)
    return end[0]
print dict((end-step, sum(1 for _ in g)) for end, g in groupby(times, key=key))

It produces the same output as histogram()-based approach.

Note: all dates that are less than start_date are put in the first bin.

defaultdict()-based solution

from collections import defaultdict

def to_s(td): # for Python 2.6
    return td.days*86400 + td.seconds #NOTE: ignore microseconds

times = (d['first_alerted_time'] for d in example_table)
counter = defaultdict(int)
for dt in times:
    if start_date <= dt < end_date:
        counter[to_s(dt - start_date) // to_s(step)] += 1

print dict((ts*step + start_date, count) for ts, count in counter.iteritems())

The output is the same as the other two solutions.

这篇关于Python-在每个动态间隔内计算日期范围内的邮件频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆