Python-100万行表中日期的矢量化差 [英] Python - Vectorized difference of dates in 1 million row table
问题描述
我有以下熊猫数据框:
Date
2018-04-10 21:05:00
2018-04-10 21:05:00
2018-04-10 21:10:00
2018-04-10 21:15:00
2018-04-10 21:35:00
我的目标是计算每次之前20分钟和之后20分钟的行数(包括前后具有相同时间的行).类似于以下内容:
My goal is to compute the number of rows that are 20 minutes before and 20 minutes after each time (including rows with the same time both before and after). Something like the following:
Date nr_20_min_bef nr_20_min_after
2018-04-10 21:05:00 2 4
2018-04-10 21:05:00 2 4
2018-04-10 21:10:00 3 2
2018-04-10 21:15:00 4 2
2018-04-10 21:35:00 2 1
我试图执行一个for循环以遍历所有行,问题是整个系列有超过一百万行,因此我一直在寻找一种更有效的解决方案.我目前的方法是使用熊猫函数:
I have tried to perform a for loop to iterate over all rows, the problem is that the whole series has more than million rows, therefore I was looking for a more efficient solution. My current approach is using pandas functions:
import datetime
import pandas
df = pd.DataFrame(pd.to_datetime(['2018-04-10 21:05:00',
'2018-04-10 21:05:00',
'2018-04-10 21:10:00',
'2018-04-10 21:15:00',
'2018-04-10 21:35:00']),columns = ['Date'])
nr_20_min_bef = []
nr_20_min_after = []
for i in range(0, len(df)):
nr_20_min_bef.append(df.Date.between(df.Date[i] -
pd.offsets.DateOffset(minutes=20), df.Date[i], inclusive = True).sum())
nr_20_min_after.append(df.Date.between(df.Date[i], df.Date[i] +
pd.offsets.DateOffset(minutes=20), inclusive = True).sum())
在这种情况下,矢量化解决方案可能是理想的选择,但是,我真的不知道该怎么做.
Probably a vectorized solution would be ideal for this case, however, I do not really know how to do it.
谢谢.
推荐答案
好消息是可以将其向量化. 坏消息是……这并不完全简单.
The good news is that it is possible to vectorize this. The bad news is... it's not exactly simple.
这是基准测试 perfplot 代码:
import numpy as np
import pandas as pd
import perfplot
def orig(df):
nr_20_min_bef = []
nr_20_min_after = []
for i in range(0, len(df)):
nr_20_min_bef.append(df.Date.between(
df.Date[i] - pd.offsets.DateOffset(minutes=20), df.Date[i], inclusive = True).sum())
nr_20_min_after.append(df.Date.between(
df.Date[i], df.Date[i] + pd.offsets.DateOffset(minutes=20), inclusive = True).sum())
df['nr_20_min_bef'] = nr_20_min_bef
df['nr_20_min_after'] = nr_20_min_after
return df
def alt(df):
df = df.copy()
df['Date'] = pd.to_datetime(df['Date'])
df['num'] = 1
df = df.set_index('Date')
dup_count = df.groupby(level=0)['num'].count()
result = dup_count.rolling('20T', closed='both').sum()
df['nr_20_min_bef'] = result.astype(int)
max_date = df.index.max()
min_date = df.index.min()
dup_count_reversed = df.groupby((max_date - df.index)[::-1] + min_date)['num'].count()
result = dup_count_reversed.rolling('20T', closed='both').sum()
result = pd.Series(result.values[::-1], dup_count.index)
df['nr_20_min_after'] = result.astype(int)
df = df.drop('num', axis=1)
df = df.reset_index()
return df
def make_df(N):
dates = (np.array(['2018-04-10'], dtype='M8[m]')
+ (np.random.randint(10, size=N).cumsum()).astype('<i8').astype('<m8[m]'))
df = pd.DataFrame({'Date': dates})
return df
def check(df1, df2):
return df1.equals(df2)
perfplot.show(
setup=make_df,
kernels=[orig, alt],
n_range=[2**k for k in range(4,10)],
logx=True,
logy=True,
xlabel='N',
equality_check=check)
显示alt
的
明显比orig
快:
除了基准测试orig
和alt
,perfplot.show
还检查
orig
和alt
返回的DataFrame是相等的.鉴于alt
的复杂性,这至少可以使我们确信它的行为与orig
相同.
In addition to benchmarking orig
and alt
, perfplot.show
also checks that
the DataFrames returned by orig
and alt
are equal. Given the complexity of alt
, this at least gives us some assurance that it behaves the same as orig
.
自orig
开始以来,为大N进行perfplot有点困难
花费相当长的时间,每个基准都重复了数百次.所以
这是较大的N
的一些%timeit
比较:
It's a little difficult to make a perfplot for large N since orig
starts
taking quite a long time and each benchmark is repeated hundreds of times. So
here is a few spot %timeit
comparisons for larger N
:
| N | orig (ms) | alt (ms) |
|-------+-----------+----------|
| 2**10 | 3040 | 9.32 |
| 2**12 | 12600 | 10.8 |
| 2**20 | ? | 909 |
In [300]: df = make_df(2**10)
In [301]: %timeit orig(df)
1 loop, best of 3: 3.04 s per loop
In [302]: %timeit alt(df)
100 loops, best of 3: 9.32 ms per loop
In [303]: df = make_df(2**12)
In [304]: %timeit orig(df)
1 loop, best of 3: 12.6 s per loop
In [305]: %timeit alt(df)
100 loops, best of 3: 10.8 ms per loop
In [306]: df = make_df(2**20)
In [307]: %timeit alt(df)
1 loop, best of 3: 909 ms per loop
现在alt
在做什么?也许最简单的方法是使用发布的df
来查看一个小示例:
Now what is alt
doing? Perhaps it is easiest to look at a small example using the df
you posted:
df = pd.DataFrame(pd.to_datetime(['2018-04-10 21:05:00',
'2018-04-10 21:05:00',
'2018-04-10 21:10:00',
'2018-04-10 21:15:00',
'2018-04-10 21:35:00']),columns = ['Date'])
主要思想是使用Series.rolling
进行滚动总和.当...的时候
系列具有DatetimeIndex,Series.rolling
可以接受
窗口大小.因此,我们可以使用修订的可变窗口来计算滚动总和
时间跨度.因此,第一步是将日期设置为DatetimeIndex:
The main idea is to use Series.rolling
to perform a rolling sum. When the
Series has a DatetimeIndex, Series.rolling
can accept a time frequency for the
window size. So we can calculate rolling sums with variable windows of a fix
time span. The first step is therefore to make the dates a DatetimeIndex:
df['Date'] = pd.to_datetime(df['Date'])
df['num'] = 1
df = df.set_index('Date')
由于df
具有重复的日期,请按DatetimeIndex值分组并计算重复的次数:
Since df
has duplicate dates, group by the DatetimeIndex values and count the number of duplicates:
dup_count = df.groupby(level=0)['num'].count()
# Date
# 2018-04-10 21:05:00 2
# 2018-04-10 21:10:00 1
# 2018-04-10 21:15:00 1
# 2018-04-10 21:35:00 1
# Name: num, dtype: int64
现在在dup_count
上计算滚动总和:
Now computing the rolling sum on dup_count
:
result = dup_count.rolling('20T', closed='both').sum()
# Date
# 2018-04-10 21:05:00 2.0
# 2018-04-10 21:10:00 3.0
# 2018-04-10 21:15:00 4.0
# 2018-04-10 21:35:00 2.0
# Name: num, dtype: float64
中提琴,是nr_20_min_bef
. 20T
指定窗口大小要长20分钟. closed='both'
指定每个窗口都包含其左端点和右端点.
Viola, that's nr_20_min_bef
. 20T
specifies the window size to be 20 minutes long. closed='both'
specifies that each window includes both its left and right endpoints.
现在,仅计算nr_20_min_after
就这么简单.从理论上讲,我们要做的就是颠倒dup_count
中的行顺序并计算另一个滚动总和.不幸的是,Series.rolling
要求DatetimeIndex单调增加:
Now if only computing nr_20_min_after
were as simple. In theory, all we need to do is reverse the order of the rows in dup_count
and compute another rolling sum. Unfortunately, Series.rolling
demands that the DatetimeIndex is monotonically increasing:
In [275]: dup_count[::-1].rolling('20T', closed='both').sum()
ValueError: index must be monotonic
由于明显的方法被阻止,我们绕道而行:
Since the obvious way is blocked, we take a detour:
max_date = df.index.max()
min_date = df.index.min()
dup_count_reversed = df.groupby((max_date - df.index)[::-1] + min_date)['num'].count()
# Date
# 2018-04-10 21:05:00 1
# 2018-04-10 21:25:00 1
# 2018-04-10 21:30:00 1
# 2018-04-10 21:35:00 2
# Name: num, dtype: int64
这将生成一个新的伪datetime DatetimeIndex进行分组:
This generates a new pseudo datetime DatetimeIndex to group by:
In [288]: (max_date - df.index)[::-1] + min_date
Out[288]:
DatetimeIndex(['2018-04-10 21:05:00', '2018-04-10 21:25:00',
'2018-04-10 21:30:00', '2018-04-10 21:35:00',
'2018-04-10 21:35:00'],
dtype='datetime64[ns]', name='Date', freq=None)
这些值可能不在df.index
中-没关系.我们唯一需要的是值是单调递增的,并且日期时间之间的差异
反转时对应df.index
中的差异.
These values may not be in df.index
-- but that's okay. The only thing we need is that the values are monotonically increasing and that the difference between the datetimes
correspond to the differences in df.index
when reversed.
现在使用这个反向的dup_count,我们可以通过计算总和来享受大奖(表现):
Now using this reversed dup_count, we can enjoy the big win (in perfomance) by taking the rolling sum:
result = dup_count_reversed.rolling('20T', closed='both').sum()
# Date
# 2018-04-10 21:05:00 1.0
# 2018-04-10 21:25:00 2.0
# 2018-04-10 21:30:00 2.0
# 2018-04-10 21:35:00 4.0
# Name: num, dtype: float64
result
具有我们想要的nr_20_min_after
值,但取反顺序,
以及索引错误.这是我们可以纠正的方法:
result
has the values we desire for nr_20_min_after
but in reversed order,
and with the wrong index. Here is how we can correct that:
result = pd.Series(result.values[::-1], dup_count.index)
# Date
# 2018-04-10 21:05:00 4.0
# 2018-04-10 21:10:00 2.0
# 2018-04-10 21:15:00 2.0
# 2018-04-10 21:35:00 1.0
# dtype: float64
基本上,这就是alt
的全部内容.
And that's basically all there is to alt
.
这篇关于Python-100万行表中日期的矢量化差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!