Python:按小时,日期和月份按年份分组在Pandas中过滤DataFrame [英] Python: Filter DataFrame in Pandas by hour, day and month grouped by year

查看:7994
本文介绍了Python:按小时,日期和月份按年份分组在Pandas中过滤DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

新的熊猫我不得不挖了很多,以找到解决这个问题的办法。我想知道一个更好的方法来解决这个问题,考虑到我仍然需要解决边界问题。



我有一套10分的权力从2009年到2012年,并希望得到所有年份的小时和日/月窗口(即按年份分组按小时,日期和月份过滤)。



我所来的是如下:

 将大熊猫导入pd 
导入numpy作为np
import datetime

dates = pd.date_range(start =08/01/2009,end =08/01/2012,freq =10min)
df = pd。 DataFrame(np.random.rand(len(dates),1)* 1500,index = dates,columns = ['Power'])

def filter(df,day,month,hour,daysWindow ,hoursWindow):

按年份分组的日期窗口和小时窗口过滤数据框

@type df:DataFrame
@param df:DataFrame日期和值

@type day:int
@param day:Day to focus on

@type month:int
@param month:重点关注

@type小时:int
@param小时:小时重点放在

@type daysWindow:int
@param daysWindow:执行天窗口选择的天数

@type hourWindow:int
@param hourWindow:执行小时窗口选择的小时数

@rtype:DataFrame
@return:返回一个数据框,

df_filtered =无
已分组= df.groupby(lambda x:x.year)
for year,groupYear in group:
groupingMonthDay = groupYear.groupby(lambda x:(x.month,x.day))
for monthDay,groupMonthDay in groupingMonthDay:
如果monthDay> ; =(月,日 - 天窗口)和monthDay< =(月,日+天窗口):
new_df = groupMonthDay.ix [groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow),datetime.time (小时+小时窗口))]
如果df_filtered为无:
df_filtered = new_df
其他:
df_filtered = df_filtered.append(new_df)
返回df_filtered

df_filtered = filter(df,day = month = 10,hour = 8,daysWindow = 1,hoursWindow = 1)
print len(df)
print len(df_filtered)

哪个返回为输出:

 >>> 
157825
117

当然这个代码需要改进选择1和小时窗口2之​​间的小时的边界问题。ie:

 >>>过滤器(df,day = 8,month = 10,hour = 1,daysWindow = 1,hoursWindow = 2)
追溯(最近的最后一次调用):
文件< interactive input> 1,在< module>
文件D:\tmp\test_filtro.py,第40行,过滤器
new_df = groupMonthDay.ix [groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow)),datetime。时间(小时+小时窗口))]
ValueError:小时必须在0..23

类似的问题会在选择1或30天的时候出现。



这个代码如何改进?

解决方案

更新代码 filter 函数确保没有边界问题:

  import pandas as pd 
import numpy as np
import datetime

dates = pd.date_range(start =08/01/2009 ,end =08/01/2012,freq =10min)
df = pd.DataFrame(np.random.rand(len(dates),1)* 1500,index = dates,columns = ['Power'])

def filter(df,day,month,hour,minute = 0,daysWindow = 1,hoursWindow = 1):

过滤器一个数据框由日期窗口和小时窗口按年份分组

@type d f:DataFrame
@param df:具有日期和值的DataFrame

@type day:int
@param day:重点放在

@类型月份:int
@param月份:重点关注

@type小时:int
@param小时:小时重点放在

@键入daysWindow:int
@param daysWindow:执行天窗口选择的天数

@type hoursWindow:int
@param hourWindow:执行小时窗口的小时数选择

@rtype:DataFrame
@return:返回一个数据框,

df_filtered =无
grouping = df.groupby(lambda x:x.year)
年份,groupYear在分组中:
date = datetime.date(年,月,日)
dateStart = date - datetime.timedelta(days = daysWindow)
dateEnd = date + datetime.timedelta(days = daysWindow + 1)
df_filtered_days = df [dateStart:dateEnd]
timeStart = datetime.t ime(0如果hour-hoursWindow< 0 else hour-hoursWindow,minute)
timeEnd = datetime.time(23 if hours + hoursWindow> 23 else hour + hoursWindow,minute)
new_df = df_filtered_days.ix [df_filtered_days.index.indexer_between_time(timeStart ,timeEnd)]
如果df_filtered为无:
df_filtered = new_df
其他:
df_filtered = df_filtered.append(new_df)
返回df_filtered

df_filtered = filter(df,day = 8,month = 10,hour = 1,daysWindow = 1,hoursWindow = 2)
打印len(df)
打印len(df_filtered)

输出是:

 >>> 
157825
174


Being new to Pandas I had to dig a lot in order to find a solution to this problem. I would like to know a better way to get this resolved, taking into account I still need to resolve the border problems.

I have a set of 10 minutal measures of "Power" from 2009 till 2012 and want to get a window of hours and day/month for all the years (i.e. Filter by hour, day and month grouped by year).

What I have come to is as follows:

import pandas as pd
import numpy as np
import datetime

dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="10min")
df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])

def filter(df, day, month, hour, daysWindow, hoursWindow):
    """
    Filter a Dataframe by a date window and hour window grouped by years

    @type df: DataFrame
    @param df: DataFrame with dates and values

    @type day: int
    @param day: Day to focus on

    @type month: int
    @param month: Month to focus on

    @type hour: int
    @param hour: Hour to focus on

    @type daysWindow: int
    @param daysWindow: Number of days to perform the days window selection

    @type hourWindow: int
    @param hourWindow: Number of hours to perform the hours window selection

    @rtype: DataFrame
    @return: Returns a DataFrame with the
    """
    df_filtered = None
    grouped = df.groupby(lambda x : x.year)
    for year, groupYear in grouped:
        groupedMonthDay = groupYear.groupby(lambda x : (x.month, x.day))
        for monthDay, groupMonthDay in groupedMonthDay:
            if monthDay >= (month,day - daysWindow) and monthDay <= (month,day + daysWindow):
                new_df = groupMonthDay.ix[groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow), datetime.time(hour + hoursWindow))]
                if df_filtered is None:
                    df_filtered = new_df
                else:
                    df_filtered = df_filtered.append(new_df)
    return df_filtered

df_filtered = filter(df,day=8, month=10, hour=8, daysWindow=1, hoursWindow=1)
print len(df)
print len(df_filtered)

Which returns as output:

>>> 
157825
117

Of course there would be an improvement this code needs regarding border issues when selecting an hour like 1 and hoursWindow 2. i.e.:

>>> filter(df,day=8, month=10, hour=1, daysWindow=1, hoursWindow=2)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "D:\tmp\test_filtro.py", line 40, in filter
    new_df = groupMonthDay.ix[groupMonthDay.index.indexer_between_time(datetime.time(hour - hoursWindow), datetime.time(hour + hoursWindow))]
ValueError: hour must be in 0..23

Similar issue would happen when selecting a day like 1 or 30.

How could this code be improved?

解决方案

Updated code for filter function ensures there is no border issues:

import pandas as pd
import numpy as np
import datetime

dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="10min")
df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])

def filter(df, day, month, hour, minute=0, daysWindow=1, hoursWindow=1):
    """
    Filter a Dataframe by a date window and hour window grouped by years

    @type df: DataFrame
    @param df: DataFrame with dates and values

    @type day: int
    @param day: Day to focus on

    @type month: int
    @param month: Month to focus on

    @type hour: int
    @param hour: Hour to focus on

    @type daysWindow: int
    @param daysWindow: Number of days to perform the days window selection

    @type hoursWindow: int
    @param hourWindow: Number of hours to perform the hours window selection

    @rtype: DataFrame
    @return: Returns a DataFrame with the
    """
    df_filtered = None
    grouped = df.groupby(lambda x : x.year)
    for year, groupYear in grouped:
        date = datetime.date(year, month, day)
        dateStart = date - datetime.timedelta(days=daysWindow)
        dateEnd = date + datetime.timedelta(days=daysWindow+1)
        df_filtered_days = df[dateStart:dateEnd]
        timeStart = datetime.time(0 if hour-hoursWindow < 0 else hour-hoursWindow, minute)
        timeEnd = datetime.time(23 if hour+hoursWindow > 23 else hour+hoursWindow, minute)
        new_df = df_filtered_days.ix[df_filtered_days.index.indexer_between_time(timeStart, timeEnd)]
        if df_filtered is None:
            df_filtered = new_df
        else:
            df_filtered = df_filtered.append(new_df)
    return df_filtered

df_filtered = filter(df,day=8, month=10, hour=1, daysWindow=1, hoursWindow=2)
print len(df)
print len(df_filtered)

Output is:

>>> 
157825
174

这篇关于Python:按小时,日期和月份按年份分组在Pandas中过滤DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆