每天过滤 pandas 数据框 [英] Filtering pandas dataframe by day

查看:61
本文介绍了每天过滤 pandas 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,其中包含以分钟为单位的外汇数据,一年(371635行):

I have a pandas data frame with forex data by minutes, one year long (371635 rows):

                           O        H        L        C
0                                                      
2017-01-02 02:00:00  1.05155  1.05197  1.05155  1.05190
2017-01-02 02:01:00  1.05209  1.05209  1.05177  1.05179
2017-01-02 02:02:00  1.05177  1.05198  1.05177  1.05178
2017-01-02 02:03:00  1.05188  1.05200  1.05188  1.05200
2017-01-02 02:04:00  1.05196  1.05204  1.05196  1.05203

我想过滤每日数据以获取一个小时范围:

I want to filter daily data to get an hour range:

dt = datetime(2017,1,1)
df_day = df1[df.index.date == dt.date()]
df_day_t = df_day.between_time('08:30', '09:30')   

如果我进行200天的for循环,则需要几分钟.我怀疑这一行的每一步

If I do a for loop with 200 days, it takes minutes. I suspect that at every step this line

df_day = df1[df.index.date == dt.date()] 

正在寻找数据集中每一行的相等性(即使它是有序数据集).

有什么方法可以加快过滤速度,还是应该做一些旧的命令for循环从一月到十二月...?

is looking for the equality with every row in the data set (even if it is an ordered data set).

Is there any way I could speed up the filtering or I should just do some old imperative for loop from January to December...?

推荐答案

避免使用Python datetime

首先,您应该避免将Python datetime与Pandas操作结合使用.有许多Pandas/NumPy友好方法可以创建datetime对象进行比较,例如pd.Timestamppd.to_datetime.这里的性能问题部分是由于

Avoid Python datetime

First you should avoid combining Python datetime with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime objects for comparison, e.g. pd.Timestamp and pd.to_datetime. Your performance issues here are partly due to this behaviour described in the docs:

pd.Series.dt.date返回python datetime.date对象的数组

pd.Series.dt.date returns an array of python datetime.date objects

以这种方式使用object dtype会消除矢量化的好处,因为操作随后需要Python级的循环.

Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.

熊猫已经具有通过归一化时间按日期分组的功能:

Pandas already has functionality to group by date via normalizing time:

for day, df_day in df.groupby(df.index.floor('d')):
    df_day_t = df_day.between_time('08:30', '09:30')
    # do something

作为另一个示例,您可以通过以下方式访问特定日期的切片:

As another example, you can access a slice for a particular day in this way:

g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)

这篇关于每天过滤 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆