大 pandas 时间戳与日期时间的缓慢表现 [英] Slow performance of pandas timestamp vs datetime
问题描述
这是一个基准测试,演示:
import datetime
import pandas
import numpy
#using datetime:
def test1()
d1 = datetime.datetime(2015,3,20,10,0,0)
d2 = datetime.datetime(2015,3,20,10,0,15)
delta = datetime.timedelta(minutes = 30)
count = 0
对于范围(500000)中的i:
如果d2 - d1> delta:
count + = 1
#使用熊猫:
def test2():
d1 = pandas.datetime(2015,3,20,10,0 ,0)
d2 = pandas.datetime(2015,3,20,10,0,15)
delta = pandas.Timedelta(minutes = 30)
count = 0
为我的范围(500000):
如果d2 - d1> delta:
count + = 1
#using numpy
def test3():
d1 = numpy.datetime64('2015-03-20 10:00: 00')
d2 = numpy.datetime64('2015-03-20 10:00:15')
delta = numpy.timedelta64(30,'m')
count = 0
对于范围(500000)中的i:
如果d2 - d1> delta:
count + = 1
time1 = datetime.datetime.now()
test1()
time2 = datetime.datetime.now( )
test2()
time3 = datetime.datetime.now()
test3()
time4 = datetime.datetime.now()
打印('DELTA test1:'+ str(time2-time1))
print('DELTA test2:'+ str(time3-time2))
print('DELTA test3:'+ str(time4-time3 ))
我的机器上的相应结果(python3.3,熊猫0.15.2) p>
DELTA test1:0:00:00.131698
DELTA test2:0:00:10.034970
DELTA test3:0 :00:05.233389
这是预期的吗?
有没有办法消除性能问题,除了切换代码到Python的默认datetime实现尽可能多吗?
我不知道你的用例,所以我只是要创建一个比较datetime / list和pa的简单例子ndas.datetime / dataframe。
tldr:对于一个小数据集,只需使用datetime和列表。对于较大的数据集,请使用pandas.datetime和数据框。
d1 = datetime.datetime(2015,3,20,10 ,0,0)
d2 = datetime.datetime(2015,3,20,15,0,0)
ts_pandas = pd.Series(pd.date_range(d1,periods = 1000 ,freq ='H'))
ts_list = ts_pandas.tolist()
delta_pandas = ts_pandas - d2
delta_list = [t-d2 for t in ts_list]
在计时之前,让我们检查一下我们得到的相同答案:
for i in range(5):print delta_pandas [i],delta_list [i]
-1天+19:00:00-1天+19:00:00
-1天+20:00:00 -1天+20:00:00
-1天+21:00:00-1天+21:00:00
-1天+22:00:00 -1天+22:00:00
-1天+23:00:00 -1天+23:00:00
看起来对于1000的大小来说,让我们为大小从10到100,000的时间:
为[10,100,1000,100000]中的sz
ts_pandas = pd.Series(pd.da te_range(d1,periods = sz,freq ='H')
ts_list = ts_pandas.tolist()
%timeit [t-d2 for ts in ts_list]
%时间ts_pandas - d2
1000循环,最佳3:247μs每循环#size = 10
1000循环,最好3:601μs每循环
100循环,最佳3:2.55 ms每循环#size = 100
1000循环,最好3:682μs每循环
10循环,最好3:23.6 ms每循环# size = 1,000
1000循环,最佳3:616μs每循环
1循环,最好3:2.41 s每循环#size = 100,000
100循环,最好的3:每循环3.32 ms
希望这些是您期望的结果。基于列表的计算的速度在大小上是线性的。基于大熊猫计算的速度基本上是从10到1000(由于熊猫开销)的不变。在某点应该变得大致线性。我没有试图弄清楚哪里,但即使是10,000也是相同的速度,所以在10,000到100,000之间。
I seem to be running into unexpectedly slow performance of arithmetic operations on pandas.Timestamp vs python regular datetime() objects.
Here is a benchmark that demonstrates:
import datetime
import pandas
import numpy
# using datetime:
def test1():
d1 = datetime.datetime(2015, 3, 20, 10, 0, 0)
d2 = datetime.datetime(2015, 3, 20, 10, 0, 15)
delta = datetime.timedelta(minutes=30)
count = 0
for i in range(500000):
if d2 - d1 > delta:
count += 1
# using pandas:
def test2():
d1 = pandas.datetime(2015, 3, 20, 10, 0, 0)
d2 = pandas.datetime(2015, 3, 20, 10, 0, 15)
delta = pandas.Timedelta(minutes=30)
count = 0
for i in range(500000):
if d2 - d1 > delta:
count += 1
# using numpy
def test3():
d1 = numpy.datetime64('2015-03-20 10:00:00')
d2 = numpy.datetime64('2015-03-20 10:00:15')
delta = numpy.timedelta64(30, 'm')
count = 0
for i in range(500000):
if d2 - d1 > delta:
count += 1
time1 = datetime.datetime.now()
test1()
time2 = datetime.datetime.now()
test2()
time3 = datetime.datetime.now()
test3()
time4 = datetime.datetime.now()
print('DELTA test1: ' + str(time2-time1))
print('DELTA test2: ' + str(time3-time2))
print('DELTA test3: ' + str(time4-time3))
And corresponding results on my machine (python3.3, pandas 0.15.2):
DELTA test1: 0:00:00.131698
DELTA test2: 0:00:10.034970
DELTA test3: 0:00:05.233389
Is this expected?
Are there ways to eliminate the performance problem other than switching code to Python's default datetime implementation as much as possible?
I don't know your use case, so I'm just going to create a simple example comparing datetime/list vs pandas.datetime/dataframe.
tldr: for a small dataset, just use datetime and a list. For a larger dataset, use pandas.datetime and a dataframe.
d1 = datetime.datetime(2015, 3, 20, 10, 0, 0)
d2 = datetime.datetime(2015, 3, 20, 15, 0, 0)
ts_pandas = pd.Series( pd.date_range(d1, periods=1000, freq='H'))
ts_list = ts_pandas.tolist()
delta_pandas = ts_pandas - d2
delta_list = [ t - d2 for t in ts_list ]
Before timing, let's check that we get the same answers:
for i in range(5): print delta_pandas[i], delta_list[i]
-1 days +19:00:00 -1 days +19:00:00
-1 days +20:00:00 -1 days +20:00:00
-1 days +21:00:00 -1 days +21:00:00
-1 days +22:00:00 -1 days +22:00:00
-1 days +23:00:00 -1 days +23:00:00
Looks good for size of 1000, let's time things for sizes ranging from 10 to 100,000:
for sz in [10,100,1000,100000]:
ts_pandas = pd.Series( pd.date_range(d1, periods=sz, freq='H'))
ts_list = ts_pandas.tolist()
%timeit [ t - d2 for t in ts_list ]
%timeit ts_pandas - d2
1000 loops, best of 3: 247 µs per loop # size = 10
1000 loops, best of 3: 601 µs per loop
100 loops, best of 3: 2.55 ms per loop # size = 100
1000 loops, best of 3: 682 µs per loop
10 loops, best of 3: 23.6 ms per loop # size = 1,000
1000 loops, best of 3: 616 µs per loop
1 loops, best of 3: 2.41 s per loop # size = 100,000
100 loops, best of 3: 3.32 ms per loop
Hopefully these are the results you'd expect. The speed of the list-based calculation is linear in size. The speed of the pandas-based calculation is basically constant for sizes from 10 to 1,000 (due to pandas overhead). It should become roughly linear at some point. I didn't try to figure out exactly where, but even 10,000 was about the same speed so somewhere between 10,000 and 100,000.
这篇关于大 pandas 时间戳与日期时间的缓慢表现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!