Python-100万行表中日期的矢量化差 [英] Python - Vectorized difference of dates in 1 million row table

查看:52
本文介绍了Python-100万行表中日期的矢量化差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下熊猫数据框:

Date                    
2018-04-10 21:05:00        
2018-04-10 21:05:00        
2018-04-10 21:10:00        
2018-04-10 21:15:00     
2018-04-10 21:35:00     

我的目标是计算每次之前20分钟和之后20分钟的行数(包括前后具有相同时间的行).类似于以下内容:

My goal is to compute the number of rows that are 20 minutes before and 20 minutes after each time (including rows with the same time both before and after). Something like the following:

Date                   nr_20_min_bef    nr_20_min_after   
2018-04-10 21:05:00          2                 4                                 
2018-04-10 21:05:00          2                 4  
2018-04-10 21:10:00          3                 2
2018-04-10 21:15:00          4                 2
2018-04-10 21:35:00          2                 1

我试图执行一个for循环以遍历所有行,问题是整个系列有超过一百万行,因此我一直在寻找一种更有效的解决方案.我目前的方法是使用熊猫函数:

I have tried to perform a for loop to iterate over all rows, the problem is that the whole series has more than million rows, therefore I was looking for a more efficient solution. My current approach is using pandas functions:

import datetime
import pandas

df = pd.DataFrame(pd.to_datetime(['2018-04-10 21:05:00',        
'2018-04-10 21:05:00',        
'2018-04-10 21:10:00',        
'2018-04-10 21:15:00',     
'2018-04-10 21:35:00']),columns = ['Date'])

nr_20_min_bef = []
nr_20_min_after = []

for i in range(0, len(df)):
    nr_20_min_bef.append(df.Date.between(df.Date[i] - 
pd.offsets.DateOffset(minutes=20), df.Date[i], inclusive = True).sum())
    nr_20_min_after.append(df.Date.between(df.Date[i], df.Date[i] + 
pd.offsets.DateOffset(minutes=20), inclusive = True).sum())

在这种情况下,矢量化解决方案可能是理想的选择,但是,我真的不知道该怎么做.

Probably a vectorized solution would be ideal for this case, however, I do not really know how to do it.

谢谢.

推荐答案

好消息是可以将其向量化. 坏消息是……这并不完全简单.

The good news is that it is possible to vectorize this. The bad news is... it's not exactly simple.

这是基准测试 perfplot 代码:

import numpy as np
import pandas as pd
import perfplot

def orig(df):
    nr_20_min_bef = []
    nr_20_min_after = []

    for i in range(0, len(df)):
        nr_20_min_bef.append(df.Date.between(
            df.Date[i] - pd.offsets.DateOffset(minutes=20), df.Date[i], inclusive = True).sum())
        nr_20_min_after.append(df.Date.between(
            df.Date[i], df.Date[i] + pd.offsets.DateOffset(minutes=20), inclusive = True).sum())
    df['nr_20_min_bef'] = nr_20_min_bef
    df['nr_20_min_after'] = nr_20_min_after
    return df

def alt(df):
    df = df.copy()
    df['Date'] = pd.to_datetime(df['Date'])
    df['num'] = 1
    df = df.set_index('Date')

    dup_count = df.groupby(level=0)['num'].count()
    result = dup_count.rolling('20T', closed='both').sum()
    df['nr_20_min_bef'] = result.astype(int)

    max_date = df.index.max()
    min_date = df.index.min()
    dup_count_reversed = df.groupby((max_date - df.index)[::-1] + min_date)['num'].count()
    result = dup_count_reversed.rolling('20T', closed='both').sum()
    result = pd.Series(result.values[::-1], dup_count.index)
    df['nr_20_min_after'] = result.astype(int)
    df = df.drop('num', axis=1)
    df = df.reset_index()
    return df

def make_df(N):
    dates = (np.array(['2018-04-10'], dtype='M8[m]') 
             + (np.random.randint(10, size=N).cumsum()).astype('<i8').astype('<m8[m]'))
    df = pd.DataFrame({'Date': dates})
    return df

def check(df1, df2):
    return df1.equals(df2)

perfplot.show(
    setup=make_df,
    kernels=[orig, alt],
    n_range=[2**k for k in range(4,10)],
    logx=True,
    logy=True,
    xlabel='N',
    equality_check=check)

显示alt

明显比orig快:

除了基准测试origaltperfplot.show还检查 origalt返回的DataFrame是相等的.鉴于alt的复杂性,这至少可以使我们确信它的行为与orig相同.

In addition to benchmarking orig and alt, perfplot.show also checks that the DataFrames returned by orig and alt are equal. Given the complexity of alt, this at least gives us some assurance that it behaves the same as orig.

orig开始以来,为大N进行perfplot有点困难 花费相当长的时间,每个基准都重复了数百次.所以 这是较大的N的一些%timeit比较:

It's a little difficult to make a perfplot for large N since orig starts taking quite a long time and each benchmark is repeated hundreds of times. So here is a few spot %timeit comparisons for larger N:

| N     | orig (ms) | alt (ms) |
|-------+-----------+----------|
| 2**10 |      3040 |     9.32 |
| 2**12 |     12600 |     10.8 |
| 2**20 |         ? |      909 |

In [300]: df = make_df(2**10)
In [301]: %timeit orig(df)
1 loop, best of 3: 3.04 s per loop
In [302]: %timeit alt(df)
100 loops, best of 3: 9.32 ms per loop
In [303]: df = make_df(2**12)
In [304]: %timeit orig(df)
1 loop, best of 3: 12.6 s per loop
In [305]: %timeit alt(df)
100 loops, best of 3: 10.8 ms per loop
In [306]: df = make_df(2**20)
In [307]: %timeit alt(df)
1 loop, best of 3: 909 ms per loop


现在alt在做什么?也许最简单的方法是使用发布的df来查看一个小示例:


Now what is alt doing? Perhaps it is easiest to look at a small example using the df you posted:

df = pd.DataFrame(pd.to_datetime(['2018-04-10 21:05:00',        
                                  '2018-04-10 21:05:00',        
                                  '2018-04-10 21:10:00',        
                                  '2018-04-10 21:15:00',     
                                  '2018-04-10 21:35:00']),columns = ['Date'])

主要思想是使用Series.rolling进行滚动总和.当...的时候 系列具有DatetimeIndex,Series.rolling可以接受 窗口大小.因此,我们可以使用修订的可变窗口来计算滚动总和 时间跨度.因此,第一步是将日期设置为DatetimeIndex:

The main idea is to use Series.rolling to perform a rolling sum. When the Series has a DatetimeIndex, Series.rolling can accept a time frequency for the window size. So we can calculate rolling sums with variable windows of a fix time span. The first step is therefore to make the dates a DatetimeIndex:

df['Date'] = pd.to_datetime(df['Date'])
df['num'] = 1
df = df.set_index('Date')

由于df具有重复的日期,请按DatetimeIndex值分组并计算重复的次数:

Since df has duplicate dates, group by the DatetimeIndex values and count the number of duplicates:

dup_count = df.groupby(level=0)['num'].count()
# Date
# 2018-04-10 21:05:00    2
# 2018-04-10 21:10:00    1
# 2018-04-10 21:15:00    1
# 2018-04-10 21:35:00    1
# Name: num, dtype: int64

现在在dup_count上计算滚动总和:

Now computing the rolling sum on dup_count:

result = dup_count.rolling('20T', closed='both').sum()
# Date
# 2018-04-10 21:05:00    2.0
# 2018-04-10 21:10:00    3.0
# 2018-04-10 21:15:00    4.0
# 2018-04-10 21:35:00    2.0
# Name: num, dtype: float64

中提琴,是nr_20_min_bef. 20T指定窗口大小要长20分钟. closed='both'指定每个窗口都包含其左端点和右端点.

Viola, that's nr_20_min_bef. 20T specifies the window size to be 20 minutes long. closed='both' specifies that each window includes both its left and right endpoints.

现在,仅计算nr_20_min_after就这么简单.从理论上讲,我们要做的就是颠倒dup_count中的行顺序并计算另一个滚动总和.不幸的是,Series.rolling要求DatetimeIndex单调增加:

Now if only computing nr_20_min_after were as simple. In theory, all we need to do is reverse the order of the rows in dup_count and compute another rolling sum. Unfortunately, Series.rolling demands that the DatetimeIndex is monotonically increasing:

In [275]: dup_count[::-1].rolling('20T', closed='both').sum()
ValueError: index must be monotonic

由于明显的方法被阻止,我们绕道而行:

Since the obvious way is blocked, we take a detour:

max_date = df.index.max()
min_date = df.index.min()
dup_count_reversed = df.groupby((max_date - df.index)[::-1] + min_date)['num'].count()
# Date
# 2018-04-10 21:05:00    1
# 2018-04-10 21:25:00    1
# 2018-04-10 21:30:00    1
# 2018-04-10 21:35:00    2
# Name: num, dtype: int64

这将生成一个新的伪datetime DatetimeIndex进行分组:

This generates a new pseudo datetime DatetimeIndex to group by:

In [288]: (max_date - df.index)[::-1] + min_date
Out[288]: 
DatetimeIndex(['2018-04-10 21:05:00', '2018-04-10 21:25:00',
               '2018-04-10 21:30:00', '2018-04-10 21:35:00',
               '2018-04-10 21:35:00'],
              dtype='datetime64[ns]', name='Date', freq=None)

这些值可能不在df.index中-没关系.我们唯一需要的是值是单调递增的,并且日期时间之间的差异 反转时对应df.index中的差异.

These values may not be in df.index -- but that's okay. The only thing we need is that the values are monotonically increasing and that the difference between the datetimes correspond to the differences in df.index when reversed.

现在使用这个反向的dup_count,我们可以通过计算总和来享受大奖(表现):

Now using this reversed dup_count, we can enjoy the big win (in perfomance) by taking the rolling sum:

result = dup_count_reversed.rolling('20T', closed='both').sum()
# Date
# 2018-04-10 21:05:00    1.0
# 2018-04-10 21:25:00    2.0
# 2018-04-10 21:30:00    2.0
# 2018-04-10 21:35:00    4.0
# Name: num, dtype: float64

result具有我们想要的nr_20_min_after值,但取反顺序, 以及索引错误.这是我们可以纠正的方法:

result has the values we desire for nr_20_min_after but in reversed order, and with the wrong index. Here is how we can correct that:

result = pd.Series(result.values[::-1], dup_count.index)
# Date
# 2018-04-10 21:05:00    4.0
# 2018-04-10 21:10:00    2.0
# 2018-04-10 21:15:00    2.0
# 2018-04-10 21:35:00    1.0
# dtype: float64

基本上,这就是alt的全部内容.

And that's basically all there is to alt.

这篇关于Python-100万行表中日期的矢量化差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆