如何包括动态时间? [英] How to include dynamic time?

查看:219
本文介绍了如何包括动态时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图根据时间段提取日志。下面的程序运行非常好,没有。的时间,并且提取该范围中的日志。



但现在我还要包括开始和结束动态给定。即上午8点到下午8点上午6点到8点等。



如何获得?在当前程序中的任何编辑也将做或单独的程序也将做。



输入:Mini版本 INPUT



代码:

  import pandas as pd 
从datetime import datetime,time
import numpy as np

fn = r'00_Dart.csv'
cols = ['UserID','StartTime ','StopTime','gps1','gps2']
df = pd.read_csv(fn,header = None,names = cols)

df ['m'] = df .StopTime + df.StartTime
df ['d'] = df.StopTime - df.StartTime

用于报告DF的'start'和'end':`r`
#它将包含相等的间隔(在这种情况下为1小时)
start = pd.to_datetime(df.StartTime.min(),unit ='s')。date()
end = pd .to_datetime(df.StopTime.max(),unit ='s')。date()+ pd.Timedelta(days = 1)

#建筑报告DF:`r`
freq ='1H'#1小时频率
idx = pd.date_range(start,end,freq = freq)
r = pd.DataFrame(index = idx)
r ['start'] = (r.index - pd.datetime(1970,1,1))。total_seconds()。astype(np.int64)

#1小时,以秒为单位,减去1秒计数两次)
interval = 60 * 60 - 1

r ['LogCount'] = 0
r ['UniqueIDCount'] = 0

i,r.iterrows()中的行:
#区间重叠测试
#https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
#我稍微简化了计算的m和d
#通过除以2,
#,因为它可以完成消除常用术语
u = df [np.abs(df.m - 2 * row.start - 间隔) df.d + interval] .UserID
r.ix [i,['LogCount','UniqueIDCount']] = [len(u),u.nunique()]

r [ 'Date'] = pd.to_datetime(r.start,unit ='s')。dt.date
r ['Day'] = pd.to_datetime(r.start,unit ='s')。 weekday_name.str [:3]
r ['StartTime'] = pd.to_datetime(r.start,unit ='s')。dt.time
r ['EndTime'] = pd.to_datetime .start + interval + 1,unit ='s')。dt.time

#r.to_csv('results.csv',index = False)
#print .LogCount> 0])
#print(r ['StartTime'],r ['EndTime'],r ['Day'],r ['LogCount'],r ['UniqueIDCount'])

rout = r [['Date','StartTime','EndTime','Day','LogCount','UniqueIDCount']]
#print rout
rout。 to_csv('one_hour.csv',index = False,header = False)



简单来说,我应该能够给予 StartTime EndTIme 。下面的代码非常接近我想要做的。但如何将这转换为熊猫。

 来自datetime import datetime,time 

start = time(8,0,0)
end = time(20,0,0)

open('USC28days_0_20','r')as infile,open('USC28days_0_20_time','w')as outfile:
for row in infile:
col = row.split()
t1 = datetime.fromtimestamp(float(col [2]))。time()
t2 = datetime.fromtimestamp (col [3]))。time()
print(t1> = start and t2 <= end)

$ b b

编辑二:在Pandas工作答案



从MaxU的答案中选择答案。下面的代码删除了给定的 StartTime StopTime


$之间的所需日志组b $ b

  import pandas as pd 
从datetime import datetime,time
import numpy as np

fn = r'00_Dart .csv'
cols = ['UserID','StartTime','StopTime','gps1','gps2']

df = pd.read_csv(fn,header = None, names = cols)

#df ['m'] = df.StopTime + df.StartTime
#df ['d'] = df.StopTime - df.StartTime

#filter input data set ...
start_hour = 8
end_hour = 9
df = df [(pd.to_datetime(df.StartTime,unit ='s')。 dt.hour> = start_hour)& (pd.to_datetime(df.StopTime,unit ='s')。dt.hour <= end_hour)]

print df

df.to_csv('time_hour。 csv',index = False,header = False)

有一个可能性,控制分钟和秒也将是很好的解决方案。



目前,这还会删除具有 StopTime 小时的日志,还会删除下一个小时。



类似

  start_hour = 8:0:0 
end_hour = 9:0:0 - 1#-1获取日志,直到8:59:59

但这给我一个错误

解决方案

请尝试:

  import pandas as pd 
从datetime import datetime,time
import numpy as np

fn = r'D:\data \gDrive\data\.stack.overflow\2016-07\dart_small.csv'
cols = ['UserID','StartTime','StopTime','gps1','gps2']

df = pd.read_csv(fn,header = None,names = cols)

df ['m'] = df.StopTime + df.StartTime
df ['d'] = df.StopTime - df.StartTime

#过滤器输入数据集...
start_hour = 8
end_hour = 20
df = df [(pd.to_datetime(df.StartTime,unit ='s')。dt.hour> = 8)& (pd.to_datetime(df.StartTime,unit ='s')。dt.hour <= 20)]


#'start'和'end' `r`
#这将包含相等的间隔(在这种情况下为1小时)
start = pd.to_datetime(df.StartTime.min(),unit ='s')。date $ b end = pd.to_datetime(df.StopTime.max(),unit ='s')。date()+ pd.Timedelta(days = 1)

#building reporting DF:`r `
freq ='1H'#1小时频率
idx = pd.date_range(start,end,freq = freq)
r = pd.DataFrame(index = idx)
r = r [(r.index.hour> = start_hour)& (r.index.hour <= end_hour)]
r ['start'] =(r.index - pd.datetime(1970,1,1))。total_seconds()。astype(np.int64)

#以秒为单位的1小时,减去1秒(这样我们就不会算两次)
interval = 60 * 60 - 1

r ['LogCount'] = 0
r ['UniqueIDCount'] = 0

对于i,row在r.iterrows():
#间隔重叠测试
#https:// en .wikipedia.org / wiki / Interval_tree#Overlap_test
#我略微简化了m和d
#的计算,通过除以2,
#,因为它可以消除通用术语
u = df [np.abs(df.m-2 * row.start-interval) df.d + interval] .UserID
r.ix [i,['LogCount','UniqueIDCount']] = [len(u),u.nunique()]

r [ 'Date'] = pd.to_datetime(r.start,unit ='s')。dt.date
r ['Day'] = pd.to_datetime(r.start,unit ='s')。 weekday_name.str [:3]
r ['StartTime'] = pd.to_datetime(r.start,unit ='s')。dt.time
r ['EndTime'] = pd.to_datetime .start + interval + 1,unit ='s')。dt.time

#r.to_csv('results.csv',index = False)
#print .LogCount> 0])
#print(r ['StartTime'],r ['EndTime'],r ['Day'],r ['LogCount'],r ['UniqueIDCount'])

rout = r [['Date','StartTime','EndTime','Day','LogCount','UniqueIDCount']]
#print rout

旧答案

  from_time = '08:00'
to_time = '18:00'
rout.between_time(from_time,to_time).to_csv('one_hour.csv',index = False, header = False)


I am trying to pull the logs with respect to time slots. The program below runs very fine when no. of hours are given and the logs in that range gets extracted.

But now I also what to include Start and end to be dynamically given. i.e. say between 8 am to 8pm or 6am to 8am and so on.

How do I get that? Any edit in the current program will also do or a separate program will also do.

Input: Mini Version of INPUT

Code:

import pandas as pd
from datetime import datetime,time
import numpy as np

fn = r'00_Dart.csv'
cols = ['UserID','StartTime','StopTime', 'gps1', 'gps2']
df = pd.read_csv(fn, header=None, names=cols)

df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime

# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)

# building reporting DF: `r`
freq = '1H'  # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)

# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1

r['LogCount'] = 0
r['UniqueIDCount'] = 0

for i, row in r.iterrows():
        # intervals overlap test
        # https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
        # i've slightly simplified the calculations of m and d
        # by getting rid of division by 2,
        # because it can be done eliminating common terms
    u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
    r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]

r['Date'] = pd.to_datetime(r.start, unit='s').dt.date
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time

#r.to_csv('results.csv', index=False)
#print(r[r.LogCount > 0])
#print (r['StartTime'], r['EndTime'], r['Day'], r['LogCount'], r['UniqueIDCount'])

rout =  r[['Date', 'StartTime', 'EndTime', 'Day', 'LogCount', 'UniqueIDCount'] ]
#print rout
rout.to_csv('one_hour.csv', index=False, header=False)

Edit:

In Simple words, I should be able to give StartTime and EndTIme in the program. The code below is very much close to what I am trying to do. But how convert this to pandas.

from datetime import datetime,time

start = time(8,0,0)
end =   time(20,0,0)

with open('USC28days_0_20', 'r') as infile, open('USC28days_0_20_time','w') as outfile:
    for row in infile:
        col = row.split()
        t1 = datetime.fromtimestamp(float(col[2])).time()
        t2 = datetime.fromtimestamp(float(col[3])).time()
        print (t1 >= start and t2 <= end)

Edit Two: Working answer in Pandas

Taking a Part from the @MaxU's answer from selected answer. The below code strips the required group of logs between the given StartTime and StopTime

import pandas as pd
from datetime import datetime,time
import numpy as np

fn = r'00_Dart.csv'
cols = ['UserID','StartTime','StopTime', 'gps1', 'gps2']

df = pd.read_csv(fn, header=None, names=cols)

#df['m'] = df.StopTime + df.StartTime
#df['d'] = df.StopTime - df.StartTime

# filter input data set ... 
start_hour = 8
end_hour = 9
df = df[(pd.to_datetime(df.StartTime, unit='s').dt.hour >= start_hour) & (pd.to_datetime(df.StopTime, unit='s').dt.hour <= end_hour)]

print df

df.to_csv('time_hour.csv', index=False, header=False)

But: If there was a possibility to have control on minutes and seconds also would be great solution.

At present this also strips the logs which have the hour of StopTime but also the minutes and seconds until the next hour.

Something like

start_hour = 8:0:0
end_hour = 9:0:0 - 1 # -1 to get the logs until 8:59:59

But this gives me an error

解决方案

try this:

import pandas as pd
from datetime import datetime,time
import numpy as np

fn = r'D:\data\gDrive\data\.stack.overflow\2016-07\dart_small.csv'
cols = ['UserID','StartTime','StopTime', 'gps1', 'gps2']

df = pd.read_csv(fn, header=None, names=cols)

df['m'] = df.StopTime + df.StartTime
df['d'] = df.StopTime - df.StartTime

# filter input data set ... 
start_hour = 8
end_hour = 20
df = df[(pd.to_datetime(df.StartTime, unit='s').dt.hour >= 8) & (pd.to_datetime(df.StartTime, unit='s').dt.hour <= 20)]


# 'start' and 'end' for the reporting DF: `r`
# which will contain equal intervals (1 hour in this case)
start = pd.to_datetime(df.StartTime.min(), unit='s').date()
end = pd.to_datetime(df.StopTime.max(), unit='s').date() + pd.Timedelta(days=1)

# building reporting DF: `r`
freq = '1H'  # 1 Hour frequency
idx = pd.date_range(start, end, freq=freq)
r = pd.DataFrame(index=idx)
r = r[(r.index.hour >= start_hour) & (r.index.hour <= end_hour)]
r['start'] = (r.index - pd.datetime(1970,1,1)).total_seconds().astype(np.int64)

# 1 hour in seconds, minus one second (so that we will not count it twice)
interval = 60*60 - 1

r['LogCount'] = 0
r['UniqueIDCount'] = 0

for i, row in r.iterrows():
        # intervals overlap test
        # https://en.wikipedia.org/wiki/Interval_tree#Overlap_test
        # i've slightly simplified the calculations of m and d
        # by getting rid of division by 2,
        # because it can be done eliminating common terms
    u = df[np.abs(df.m - 2*row.start - interval) < df.d + interval].UserID
    r.ix[i, ['LogCount', 'UniqueIDCount']] = [len(u), u.nunique()]

r['Date'] = pd.to_datetime(r.start, unit='s').dt.date
r['Day'] = pd.to_datetime(r.start, unit='s').dt.weekday_name.str[:3]
r['StartTime'] = pd.to_datetime(r.start, unit='s').dt.time
r['EndTime'] = pd.to_datetime(r.start + interval + 1, unit='s').dt.time

#r.to_csv('results.csv', index=False)
#print(r[r.LogCount > 0])
#print (r['StartTime'], r['EndTime'], r['Day'], r['LogCount'], r['UniqueIDCount'])

rout =  r[['Date', 'StartTime', 'EndTime', 'Day', 'LogCount', 'UniqueIDCount'] ]
#print rout

OLD answer:

from_time = '08:00'
to_time = '18:00'
rout.between_time(from_time, to_time).to_csv('one_hour.csv', index=False, header=False)

这篇关于如何包括动态时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆