小时,日期,日数计算 [英] Hours, Date, Day Count Calculation

查看:147
本文介绍了小时,日期,日数计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个巨大的数据集有几天和时间戳的日期。日期时间格式为UNIX格式。数据集是一些登录的日志。



代码应该分组开始和结束时间日志,并提供日志计数和唯一的ID计数。



我正在尝试获取一些统计信息:

 每小时总计记录数唯一登录ids每小时。 

日志选择小时,即 24小时,12小时,6小时,1小时等一天以及此类选项。



我可以使用开始结束小时,但是我无法获得日志 unique ids 的统计数据。



代码:

  from datetime import datetime,time 

#从开始到结束时间分割数据
start = time(8,0,0)
end = time(20,0,0)

open('input','r')作为infile,open('output','w')作为outfile:
for infile:
col = row.split()
t1 = datetime.fromtimestamp(float(col [2]))。time()
t2 = datetime.fromtimestamp(float(col [3]))time()
print(t1> = start和t2< = end)

输入数据格式:数据没有头,但字段在下面给出。输入中不知道天数。

  UserID,StartTime,StopTime,GPS1,GPS2 
00022d9064bc,1073260801 ,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494 ,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558, 439525
00904bf91b75,107326080 4,1073263786,817558,439525

预期输出:示例输出

  StartTime,EndTime,Day,LogCount,UniqueIDCount 

00:00:00,01:00:00,Mon,349,30

StartTime和Endtime =人类可读格式

只能将数据与范围时间已经实现,但是我正在尝试写一个关闭时间,并计算日志和uniqueids的计数。解决方案与熊猫也是欢迎。



编辑一:更多细节

  StartTime  - > EndTIime 
1/5/2004,5:30:01 - > 1/5/2004,5:30:03

但是它落在 5:00:00 - > 6:00:00 。所以在时间范围内这样计数所有的日志是我正在寻找的。对于其他人也是如此,类似于

  5:00:00  - > 6:00:00小时数
00:00:00 - > 6:00:00每6小时
00:00:00 - > 12:00:00每12小时

2004年1月5日,星期一 - count
2004年1月6日,星期二 - >计数

等等寻找一个通用程序,我可以根据需要更改时间/小时范围。

解决方案

不幸的是我找不到任何优雅的解决方案。



这是我的尝试:

  fn = r'D:\temp\.data\dart_small.csv'
cols = ['UserID','StartTime','StopTime','GPS1','GPS2']
df = pd.read_csv(fn,header = None,names = cols)

df ['m'] = df.StopTime + df.StartTime
df ['d'] = df.StopTime - df.StartTime

#'开始'和'结束'报告DF:`r`
#将包含相等的间隔(在这种情况下为1小时)
start = pd.to_datetime(df.StartTime.min(),unit ='s') .date()
end = pd.to_datetime(df.StopTime.max(),unit ='s')。date()+ pd.Timedelta(days = 1)

#建立报告DF:`r`
freq ='1H'#1小时频率
idx = pd.date_range(start,end,freq = freq)
r = pd.DataFrame(index = idx )
r ['start'] =(r.index - pd.datetime(1970,1,1))。total_seconds()。astype(np.int64)

#1小时,秒,减一秒(这样我们不会计算两次)
interval = 60 * 60 - 1

r ['LogCount'] = 0
r ['UniqueIDCount'] = 0


for i,row in r.iterrows():
#间隔重叠测试
#https:// en.wikipedia.org/wiki/Interval_tree#Overlap_test
#我略微简化了m和d
#的计算,通过将除以2,
#,因为可以完成消除常用术语
u = df [np.abs(df.m - 2 * row.start - interval)< df.d + interval] .UserID
r.ix [i,['LogCount','UniqueIDCount']] = [len(u),u.nunique()]

r [ 'Day'] = pd.to_datetime(r.start,unit ='s')。dt.weekday_name.str [:3]
r ['StartTime'] = pd.to_datetime(r.start,unit =' s')dt.time
r ['EndTime'] = pd.to_datetime(r.start + interval + 1,unit ='s')dt.time

print(r [r.LogCount> 0])

PS在报告 DF - r ,它会计算得越快。因此,如果事先知道这些时间表不包含任何数据(例如在周末,假期等),您可能需要排除行(次)。



结果:

 启动LogCount UniqueIDCount Day StartTime EndTime 
2004-01-05 00:00:00 1073260800 24 15星期一00:00:00 01:00:00
2004-01-05 01:00:00 1073264400 5 5星期一01:00:00 02:00:00
2004-01-05 02 :00:00 1073268000 3 3星期一02:00:00 03:00:00
2004-01-05 03:00:00 1073271600 3 3星期一03:00:00 04:00:00
2004-01-05 04:00:00 1073275200 2 2一月04:00:00 05:00:00
2004-01-06 12:00:00 1073390400 22 12周二12:00:00 13:00 :00
2004-01-06 13:00:00 1073394000 3 2 Tue 13:00:00 14:00:00
2004-01-06 14:00:00 1073397600 3 2 Tue 14: 00:00 15:00:00
2 004-01-06 15:00:00 1073401200 3 2 Tue 15:00:00 16:00:00
2004-01-10 16:00:00 1073750400 20 11星期六16:00:00 17:00 :00
2004-01-14 23:00:00 1074121200 218 69周三23:00:00 00:00:00
2004-01-15 00:00:00 1074124800 12 11周四00: 00:00 01:00:00
2004-01-15 01:00:00 1074128400 1 1 Thu 01:00:00 02:00:00


I have this huge dataset which has dates for several days and timestamps. The datetime format is in UNIX format. The datasets are logs of some login.

The code is supposed to group start and end time logs and provide log counts and unique id counts.

I am trying to get some stats like:

total log counts per hour & unique login ids per hour. 

log count with choice of hours i.e. 24hrs, 12hrs, 6 hrs, 1 hr, etc and day of the week and such options.

I am able to split the data with start and end hours but I am not able to get the stats of counts of logs and unique ids.

Code:

from datetime import datetime,time

# This splits data from start to end time 
start = time(8,0,0)
end =   time(20,0,0)

    with open('input', 'r') as infile, open('output','w') as outfile:
        for row in infile:
            col = row.split()
            t1 = datetime.fromtimestamp(float(col[2])).time()
            t2 = datetime.fromtimestamp(float(col[3])).time()
            print (t1 >= start and t2 <= end)

Input data format: The data has no headers but the fields are given below. The number of days is not known in input.

UserID, StartTime, StopTime, GPS1, GPS2
00022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d9064bc,1073260801,1073260803,819251,440006
00022dba8f51,1073260801,1073260803,819251,440006
00022de1c6c1,1073260801,1073260803,819251,440006
003065f30f37,1073260801,1073260803,819251,440006
00904b48a3b6,1073260801,1073260803,819251,440006
00904b83a0ea,1073260803,1073260810,819213,439954
00904b85d3cf,1073260803,1073261920,817526,439458
00904b14b494,1073260804,1073265410,817558,439525
00904b99499c,1073260804,1073262625,817558,439525
00904bb96e83,1073260804,1073265163,817558,439525
00904bf91b75,1073260804,1073263786,817558,439525

Expected Output: Example Output

StartTime, EndTime, Day, LogCount, UniqueIDCount

00:00:00, 01:00:00, Mon, 349, 30  

StartTime and Endtime = Human readable format

Only to separate data with range of time is already achieved, but I am trying to write a round off time and calculate the counts of logs and uniqueids. Solution with Pandas is also welcome.

Edit One: I more details

StartTime         --> EndTIime
1/5/2004, 5:30:01 --> 1/5/2004, 5:30:03

But that falls between 5:00:00 --> 6:00:00 . So this way count of all the logs in the time range is what I am trying to find. Similarly for others also like

5:00:00 --> 6:00:00 Hourly Count 
00:00:00 --> 6:00:00 Every 6 hours 
00:00:00 --> 12:00:00 Every 12 hours 

5 Jan 2004, Mon --> count 
6 Jan 2004, Tue --> Count

And so on Looking for a generic program where I can change the time/hours range as needed.

解决方案

Unfortunately i couldn't find any elegant solution.

Here is my attempt:

fn = r'D:\temp\.data\dart_small.csv'
cols = ['UserID','StartTime','StopTime','GPS1','GPS2']
df = pd.read_csv(fn, header=None, names=cols)

df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime

# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)

# building reporting DF: `r`
freq = '1H'  # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)

# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1

r['LogCount'] = 0
r['UniqueIDCount'] = 0


for i, row in r.iterrows():
        # intervals overlap test
        # https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
        # i've slightly simplified the calculations of m and d
        # by getting rid of division by 2,
        # because it can be done eliminating common terms
    u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
    r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]

r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time

print(r[r.LogCount > 0])

PS the less periods you will have in the report DF - r, the faster it will count. So you may want to get rid of rows (times) if you know beforehand that those timeframes won't contain any data (for example during the weekends, holidays, etc.)

Result:

                          start  LogCount  UniqueIDCount  Day StartTime   EndTime
2004-01-05 00:00:00  1073260800        24             15  Mon  00:00:00  01:00:00
2004-01-05 01:00:00  1073264400         5              5  Mon  01:00:00  02:00:00
2004-01-05 02:00:00  1073268000         3              3  Mon  02:00:00  03:00:00
2004-01-05 03:00:00  1073271600         3              3  Mon  03:00:00  04:00:00
2004-01-05 04:00:00  1073275200         2              2  Mon  04:00:00  05:00:00
2004-01-06 12:00:00  1073390400        22             12  Tue  12:00:00  13:00:00
2004-01-06 13:00:00  1073394000         3              2  Tue  13:00:00  14:00:00
2004-01-06 14:00:00  1073397600         3              2  Tue  14:00:00  15:00:00
2004-01-06 15:00:00  1073401200         3              2  Tue  15:00:00  16:00:00
2004-01-10 16:00:00  1073750400        20             11  Sat  16:00:00  17:00:00
2004-01-14 23:00:00  1074121200       218             69  Wed  23:00:00  00:00:00
2004-01-15 00:00:00  1074124800        12             11  Thu  00:00:00  01:00:00
2004-01-15 01:00:00  1074128400         1              1  Thu  01:00:00  02:00:00

这篇关于小时,日期,日数计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆