Python:按小时,日期和月份按年份分组在Pandas中过滤DataFrame [英] Python: Filter DataFrame in Pandas by hour, day and month grouped by year
问题描述
我有一套10分的权力从2009年到2012年,并希望得到所有年份的小时和日/月窗口(即按年份分组按小时,日期和月份过滤)。
我所来的是如下:
将大熊猫导入pd
导入numpy作为np
import datetime
dates = pd.date_range(start =08/01/2009,end =08/01/2012,freq =10min)
df = pd。 DataFrame(np.random.rand(len(dates),1)* 1500,index = dates,columns = ['Power'])
def filter(df,day,month,hour,daysWindow ,hoursWindow):
按年份分组的日期窗口和小时窗口过滤数据框
@type df:DataFrame
@param df:DataFrame日期和值
@type day:int
@param day:Day to focus on
@type month:int
@param month:重点关注
@type小时:int
@param小时:小时重点放在
@type daysWindow:int
@param daysWindow:执行天窗口选择的天数
@type hourWindow:int
@param hourWindow:执行小时窗口选择的小时数
@rtype:DataFrame
@return:返回一个数据框,
df_filtered =无
已分组= df.groupby(lambda x:x.year)
for year,groupYear in group:
groupingMonthDay = groupYear.groupby(lambda x:(x.month,x.day))
for monthDay,groupMonthDay in groupingMonthDay:
如果monthDay> ; =(月,日 - 天窗口)和monthDay< =(月,日+天窗口):
new_df = groupMonthDay.ix [groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow),datetime.time (小时+小时窗口))]
如果df_filtered为无:
df_filtered = new_df
其他:
df_filtered = df_filtered.append(new_df)
返回df_filtered
df_filtered = filter(df,day = month = 10,hour = 8,daysWindow = 1,hoursWindow = 1)
print len(df)
print len(df_filtered)
哪个返回为输出:
>>>
157825
117
当然这个代码需要改进选择1和小时窗口2之间的小时的边界问题。ie:
>>>过滤器(df,day = 8,month = 10,hour = 1,daysWindow = 1,hoursWindow = 2)
追溯(最近的最后一次调用):
文件< interactive input> 1,在< module>
文件D:\tmp\test_filtro.py,第40行,过滤器
new_df = groupMonthDay.ix [groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow)),datetime。时间(小时+小时窗口))]
ValueError:小时必须在0..23
类似的问题会在选择1或30天的时候出现。
这个代码如何改进?
更新代码 filter
函数确保没有边界问题:
import pandas as pd
import numpy as np
import datetime
dates = pd.date_range(start =08/01/2009 ,end =08/01/2012,freq =10min)
df = pd.DataFrame(np.random.rand(len(dates),1)* 1500,index = dates,columns = ['Power'])
def filter(df,day,month,hour,minute = 0,daysWindow = 1,hoursWindow = 1):
过滤器一个数据框由日期窗口和小时窗口按年份分组
@type d f:DataFrame
@param df:具有日期和值的DataFrame
@type day:int
@param day:重点放在
@类型月份:int
@param月份:重点关注
@type小时:int
@param小时:小时重点放在
@键入daysWindow:int
@param daysWindow:执行天窗口选择的天数
@type hoursWindow:int
@param hourWindow:执行小时窗口的小时数选择
@rtype:DataFrame
@return:返回一个数据框,
df_filtered =无
grouping = df.groupby(lambda x:x.year)
年份,groupYear在分组中:
date = datetime.date(年,月,日)
dateStart = date - datetime.timedelta(days = daysWindow)
dateEnd = date + datetime.timedelta(days = daysWindow + 1)
df_filtered_days = df [dateStart:dateEnd]
timeStart = datetime.t ime(0如果hour-hoursWindow< 0 else hour-hoursWindow,minute)
timeEnd = datetime.time(23 if hours + hoursWindow> 23 else hour + hoursWindow,minute)
new_df = df_filtered_days.ix [df_filtered_days.index.indexer_between_time(timeStart ,timeEnd)]
如果df_filtered为无:
df_filtered = new_df
其他:
df_filtered = df_filtered.append(new_df)
返回df_filtered
df_filtered = filter(df,day = 8,month = 10,hour = 1,daysWindow = 1,hoursWindow = 2)
打印len(df)
打印len(df_filtered)
输出是:
>>>
157825
174
Being new to Pandas I had to dig a lot in order to find a solution to this problem. I would like to know a better way to get this resolved, taking into account I still need to resolve the border problems.
I have a set of 10 minutal measures of "Power" from 2009 till 2012 and want to get a window of hours and day/month for all the years (i.e. Filter by hour, day and month grouped by year).
What I have come to is as follows:
import pandas as pd
import numpy as np
import datetime
dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="10min")
df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])
def filter(df, day, month, hour, daysWindow, hoursWindow):
"""
Filter a Dataframe by a date window and hour window grouped by years
@type df: DataFrame
@param df: DataFrame with dates and values
@type day: int
@param day: Day to focus on
@type month: int
@param month: Month to focus on
@type hour: int
@param hour: Hour to focus on
@type daysWindow: int
@param daysWindow: Number of days to perform the days window selection
@type hourWindow: int
@param hourWindow: Number of hours to perform the hours window selection
@rtype: DataFrame
@return: Returns a DataFrame with the
"""
df_filtered = None
grouped = df.groupby(lambda x : x.year)
for year, groupYear in grouped:
groupedMonthDay = groupYear.groupby(lambda x : (x.month, x.day))
for monthDay, groupMonthDay in groupedMonthDay:
if monthDay >= (month,day - daysWindow) and monthDay <= (month,day + daysWindow):
new_df = groupMonthDay.ix[groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow), datetime.time(hour + hoursWindow))]
if df_filtered is None:
df_filtered = new_df
else:
df_filtered = df_filtered.append(new_df)
return df_filtered
df_filtered = filter(df,day=8, month=10, hour=8, daysWindow=1, hoursWindow=1)
print len(df)
print len(df_filtered)
Which returns as output:
>>>
157825
117
Of course there would be an improvement this code needs regarding border issues when selecting an hour like 1 and hoursWindow 2. i.e.:
>>> filter(df,day=8, month=10, hour=1, daysWindow=1, hoursWindow=2)
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "D:\tmp\test_filtro.py", line 40, in filter
new_df = groupMonthDay.ix[groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow), datetime.time(hour + hoursWindow))]
ValueError: hour must be in 0..23
Similar issue would happen when selecting a day like 1 or 30.
How could this code be improved?
Updated code for filter
function ensures there is no border issues:
import pandas as pd
import numpy as np
import datetime
dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="10min")
df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])
def filter(df, day, month, hour, minute=0, daysWindow=1, hoursWindow=1):
"""
Filter a Dataframe by a date window and hour window grouped by years
@type df: DataFrame
@param df: DataFrame with dates and values
@type day: int
@param day: Day to focus on
@type month: int
@param month: Month to focus on
@type hour: int
@param hour: Hour to focus on
@type daysWindow: int
@param daysWindow: Number of days to perform the days window selection
@type hoursWindow: int
@param hourWindow: Number of hours to perform the hours window selection
@rtype: DataFrame
@return: Returns a DataFrame with the
"""
df_filtered = None
grouped = df.groupby(lambda x : x.year)
for year, groupYear in grouped:
date = datetime.date(year, month, day)
dateStart = date - datetime.timedelta(days=daysWindow)
dateEnd = date + datetime.timedelta(days=daysWindow+1)
df_filtered_days = df[dateStart:dateEnd]
timeStart = datetime.time(0 if hour-hoursWindow < 0 else hour-hoursWindow, minute)
timeEnd = datetime.time(23 if hour+hoursWindow > 23 else hour+hoursWindow, minute)
new_df = df_filtered_days.ix[df_filtered_days.index.indexer_between_time(timeStart, timeEnd)]
if df_filtered is None:
df_filtered = new_df
else:
df_filtered = df_filtered.append(new_df)
return df_filtered
df_filtered = filter(df,day=8, month=10, hour=1, daysWindow=1, hoursWindow=2)
print len(df)
print len(df_filtered)
Output is:
>>>
157825
174
这篇关于Python:按小时,日期和月份按年份分组在Pandas中过滤DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!