大 pandas 时间戳与日期时间的缓慢表现 [英] Slow performance of pandas timestamp vs datetime

查看:171
本文介绍了大 pandas 时间戳与日期时间的缓慢表现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我似乎在熊猫的算术运算中遇到了令人惊讶的慢性能.Timestamp vs python常规datetime()对象。

这是一个基准测试,演示:

  import datetime 
import pandas
import numpy

#using datetime:
def test1()
d1 = datetime.datetime(2015,3,20,10,0,0)
d2 = datetime.datetime(2015,3,20,10,0,15)
delta = datetime.timedelta(minutes = 30)

count = 0
对于范围(500000)中的i:
如果d2 - d1> delta:
count + = 1

#使用熊猫:
def test2():
d1 = pandas.datetime(2015,3,20,10,0 ,0)
d2 = pandas.datetime(2015,3,20,10,0,15)
delta = pandas.Timedelta(minutes = 30)

count = 0
为我的范围(500000):
如果d2 - d1> delta:
count + = 1

#using numpy
def test3():
d1 = numpy.datetime64('2015-03-20 10:00: 00')
d2 = numpy.datetime64('2015-03-20 10:00:15')
delta = numpy.timedelta64(30,'m')

count = 0
对于范围(500000)中的i:
如果d2 - d1> delta:
count + = 1


time1 = datetime.datetime.now()
test1()
time2 = datetime.datetime.now( )
test2()
time3 = datetime.datetime.now()
test3()
time4 = datetime.datetime.now()

打印('DELTA test1:'+ str(time2-time1))
print('DELTA test2:'+ str(time3-time2))
print('DELTA test3:'+ str(time4-time3 ))

我的机器上的相应结果(python3.3,熊猫0.15.2) p>

  DELTA test1:0:00:00.131698 
DELTA test2:0:00:10.034970
DELTA test3:0 :00:05.233389

这是预期的吗?

有没有办法消除性能问题,除了切换代码到Python的默认datetime实现尽可能多吗?

解决方案

我不知道你的用例,所以我只是要创建一个比较datetime / list和pa的简单例子ndas.datetime / dataframe。



tldr:对于一个小数据集,只需使用datetime和列表。对于较大的数据集,请使用pandas.datetime和数据框。

  d1 = datetime.datetime(2015,3,20,10 ,0,0)
d2 = datetime.datetime(2015,3,20,15,0,0)

ts_pandas = pd.Series(pd.date_range(d1,periods = 1000 ,freq ='H'))
ts_list = ts_pandas.tolist()

delta_pandas = ts_pandas - d2
delta_list = [t-d2 for t in ts_list]

在计时之前,让我们检查一下我们得到的相同答案:

  for i in range(5):print delta_pandas [i],delta_list [i] 

-1天+19:00:00-1天+19:00:00
-1天+20:00:00 -1天+20:00:00
-1天+21:00:00-1天+21:00:00
-1天+22:00:00 -1天+22:00:00
-1天+23:00:00 -1天+23:00:00

看起来对于1000的大小来说,让我们为大小从10到100,000的时间:

 为[10,100,1000,100000]中的sz 

ts_pandas = pd.Series(pd.da te_range(d1,periods = sz,freq ='H')
ts_list = ts_pandas.tolist()

%timeit [t-d2 for ts in ts_list]
%时间ts_pandas - d2

1000循环,最佳3:247μs每循环#size = 10
1000循环,最好3:601μs每循环

100循环,最佳3:2.55 ms每循环#size = 100
1000循环,最好3:682μs每循环

10循环,最好3:23.6 ms每循环# size = 1,000
1000循环,最佳3:616μs每循环

1循环,最好3:2.41 s每循环#size = 100,000
100循环,最好的3:每循环3.32 ms

希望这些是您期望的结果。基于列表的计算的速度在大小上是线性的。基于大熊猫计算的速度基本上是从10到1000(由于熊猫开销)的不变。在某点应该变得大致线性。我没有试图弄清楚哪里,但即使是10,000也是相同的速度,所以在10,000到100,000之间。


I seem to be running into unexpectedly slow performance of arithmetic operations on pandas.Timestamp vs python regular datetime() objects.
Here is a benchmark that demonstrates:

import datetime
import pandas
import numpy

# using datetime:
def test1():
    d1 = datetime.datetime(2015, 3, 20, 10, 0, 0)
    d2 = datetime.datetime(2015, 3, 20, 10, 0, 15)
    delta = datetime.timedelta(minutes=30)

    count = 0
    for i in range(500000):
        if d2 - d1 > delta:
            count += 1

# using pandas:
def test2():
    d1 = pandas.datetime(2015, 3, 20, 10, 0, 0)
    d2 = pandas.datetime(2015, 3, 20, 10, 0, 15)
    delta = pandas.Timedelta(minutes=30)

    count = 0
    for i in range(500000):
        if d2 - d1 > delta:
            count += 1

# using numpy
def test3():
    d1 = numpy.datetime64('2015-03-20 10:00:00')
    d2 = numpy.datetime64('2015-03-20 10:00:15')
    delta = numpy.timedelta64(30, 'm')

    count = 0
    for i in range(500000):
        if d2 - d1 > delta:
            count += 1


  time1 = datetime.datetime.now()
  test1()
  time2 = datetime.datetime.now()
  test2()
  time3 = datetime.datetime.now()
  test3()
  time4 = datetime.datetime.now()

  print('DELTA test1: ' + str(time2-time1))
  print('DELTA test2: ' + str(time3-time2))
  print('DELTA test3: ' + str(time4-time3))

And corresponding results on my machine (python3.3, pandas 0.15.2):

DELTA test1: 0:00:00.131698
DELTA test2: 0:00:10.034970
DELTA test3: 0:00:05.233389

Is this expected?
Are there ways to eliminate the performance problem other than switching code to Python's default datetime implementation as much as possible?

解决方案

I don't know your use case, so I'm just going to create a simple example comparing datetime/list vs pandas.datetime/dataframe.

tldr: for a small dataset, just use datetime and a list. For a larger dataset, use pandas.datetime and a dataframe.

d1 = datetime.datetime(2015, 3, 20, 10, 0, 0)
d2 = datetime.datetime(2015, 3, 20, 15, 0, 0)

ts_pandas = pd.Series( pd.date_range(d1, periods=1000, freq='H'))
ts_list   = ts_pandas.tolist()

delta_pandas = ts_pandas - d2
delta_list   = [ t - d2 for t in ts_list ]

Before timing, let's check that we get the same answers:

for i in range(5):  print delta_pandas[i], delta_list[i]

-1 days +19:00:00 -1 days +19:00:00
-1 days +20:00:00 -1 days +20:00:00
-1 days +21:00:00 -1 days +21:00:00
-1 days +22:00:00 -1 days +22:00:00
-1 days +23:00:00 -1 days +23:00:00

Looks good for size of 1000, let's time things for sizes ranging from 10 to 100,000:

for sz in [10,100,1000,100000]:

    ts_pandas = pd.Series( pd.date_range(d1, periods=sz, freq='H'))
    ts_list   = ts_pandas.tolist()

    %timeit [ t - d2 for t in ts_list ]
    %timeit ts_pandas - d2

1000 loops, best of 3: 247 µs per loop  # size = 10
1000 loops, best of 3: 601 µs per loop

100 loops, best of 3: 2.55 ms per loop  # size = 100
1000 loops, best of 3: 682 µs per loop

10 loops, best of 3: 23.6 ms per loop   # size = 1,000
1000 loops, best of 3: 616 µs per loop

1 loops, best of 3: 2.41 s per loop     # size = 100,000
100 loops, best of 3: 3.32 ms per loop

Hopefully these are the results you'd expect. The speed of the list-based calculation is linear in size. The speed of the pandas-based calculation is basically constant for sizes from 10 to 1,000 (due to pandas overhead). It should become roughly linear at some point. I didn't try to figure out exactly where, but even 10,000 was about the same speed so somewhere between 10,000 and 100,000.

这篇关于大 pandas 时间戳与日期时间的缓慢表现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆