具有不同偏移量矢量的 pandas 矢量化日期偏移量操作 [英] Pandas Vectorized Date Offset Operations with Vector of Differing Offsets

查看:140
本文介绍了具有不同偏移量矢量的 pandas 矢量化日期偏移量操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试执行以下操作,但似乎不支持此模式下的矢量化操作.

I am trying to do the following but is seems that vectorized operations in this mode are not supported.

import pandas as pd
df=pd.DataFrame([[2017,1,15,1],
             [2017,1,15,2],
             [2017,1,15,3],
             [2017,1,15,4],
             [2017,1,15,5],
             [2017,1,15,6],
             [2017,1,15,7]],
             columns=['year','month','day','month_offset'])
df['date']=df.apply(lambda g: pd.datetime(g.year,g.month,g.day),axis=1)
df['offset']=df.apply(lambda g: pd.offsets.MonthEnd(g.month_offset),axis=1)
df['date_offset']=df.date+df.offset

这是代码段中最后一条语句返回的警告:

This is the warning returned for last statement in the code snippet:

C:\ Python3.5.2.3 \ WinPython-64bit-3.5.2.3 \ python-3.5.2.amd64 \ lib \ site-packages \ pandas \ core \ ops.py:533:PerformanceWarning:加/减数组DateOffsets到Series的值未向量化 系列未矢量化",PerformanceWarning)

C:\Python3.5.2.3\WinPython-64bit-3.5.2.3\python-3.5.2.amd64\lib\site-packages\pandas\core\ops.py:533: PerformanceWarning: Adding/subtracting array of DateOffsets to Series not vectorized "Series not vectorized", PerformanceWarning)

出于性能方面的考虑,我希望将此操作作为矢量化操作.

I would like to this to work as a vectorized operation because of the performance benefits.

谢谢.

最后,对@ john-zwinck后面的方法进行比较:

To end, comparison of methods following on from @john-zwinck:

import time
import pandas as pd
import numpy as np

df=pd.DataFrame([[2017,1,1,1],
             [2017,1,1,2],
             [2017,1,1,3],
             [2017,1,1,4],
             [2017,1,1,5],
             [2017,1,1,6],
             [2017,1,1,7]],
             columns=['year','month','day','month_offset'])

df['mydate']=df.apply(lambda g: 
pd.datetime(g.year,g.month,g.day),axis=1)
start_time=time.time()
df['pandas_offset']=df.apply(lambda g: g.mydate + 
pd.offsets.MonthEnd(g.month_offset),axis=1)
end_time=time.time()
print('Method1 {} seconds'.format(end_time-start_time))

start_time=time.time()
df['numpy_offset']=(df.mydate.values.astype('M8[M]')+ 
df.month_offset.values * np.timedelta64(1, 'M')).astype('M8[D]') - 
np.timedelta64(1, 'D')
end_time=time.time()
print('Method3 with numpy vectorization {} seconds'.format(end_time-
start_time))

结果:

index year  month  day  month_offset     mydate    offset1      final
0  2017      1    1             1 2017-01-01 2017-01-31 2017-01-31
1  2017      1    1             2 2017-01-01 2017-02-28 2017-02-28
2  2017      1    1             3 2017-01-01 2017-03-31 2017-03-31
3  2017      1    1             4 2017-01-01 2017-04-30 2017-04-30
4  2017      1    1             5 2017-01-01 2017-05-31 2017-05-31
5  2017      1    1             6 2017-01-01 2017-06-30 2017-06-30
6  2017      1    1             7 2017-01-01 2017-07-31 2017-07-31


runfile('C:/bitbucket/test/vector_dates.py', wdir='C:/bitbucket/test')
Method 1 0.003999948501586914 seconds
Method 2 with numpy vectorization 0.0009999275207519531 seconds

明显的numpy快得多

Clearly numpy much faster

推荐答案

一种真正的矢量化方法是从month_offset构造一个numpy.timedelta64数组,将其添加到日期数组中,然后减去返回上个月的最后一天.

A truly vectorized way to do this is to construct an array of numpy.timedelta64 from month_offset, add this to the array of dates, then subtract numpy.timedelta64(1, 'D') to go back to the last day of the previous month.

使用apply(lambda)的解决方案可能要慢得多.并且如警告所述,某些熊猫的日期偏移量操作未向量化.如果您的数据很大,最好避免使用它们.像busday_offset()timedelta64这样的NumPy设施都表现出色.

Solutions using apply(lambda) are likely to be much slower. And as the warning said, some Pandas date offset operations are not vectorized. If your data are large, it's better to avoid them. The NumPy facilities like busday_offset() and timedelta64 are fully performant.

这篇关于具有不同偏移量矢量的 pandas 矢量化日期偏移量操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆