大 pandas 时间戳与日期时间的缓慢表现 [英] Slow performance of pandas timestamp vs datetime

查看：171 发布时间：2017/4/14 6:26:07 python performance datetime numpy pandas

本文介绍了大 pandas 时间戳与日期时间的缓慢表现的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我似乎在熊猫的算术运算中遇到了令人惊讶的慢性能.Timestamp vs python常规datetime（）对象。

这是一个基准测试，演示：

  import datetime 
 import pandas 
 import numpy 
 
＃using datetime：
 def test1（） 
 d1 = datetime.datetime（2015，3，20，10，0，0）
 d2 = datetime.datetime（2015，3，20，10，0，15）
 delta = datetime.timedelta（minutes = 30）
 
 count = 0 
对于范围（500000）中的i：
如果d2  -  d1> delta：
 count + = 1 
 
＃使用熊猫：
 def test2（）：
 d1 = pandas.datetime（2015，3，20，10，0 ，0）
 d2 = pandas.datetime（2015，3，20，10，0，15）
 delta = pandas.Timedelta（minutes = 30）
 
 count = 0 
为我的范围（500000）：
如果d2  -  d1> delta：
 count + = 1 
 
＃using numpy 
 def test3（）：
 d1 = numpy.datetime64（'2015-03-20 10:00： 00'）
 d2 = numpy.datetime64（'2015-03-20 10:00:15'）
 delta = numpy.timedelta64（30，'m'）
 
 count = 0 
对于范围（500000）中的i：
如果d2  -  d1> delta：
 count + = 1 
 
 
 time1 = datetime.datetime.now（）
 test1（）
 time2 = datetime.datetime.now（ ）
 test2（）
 time3 = datetime.datetime.now（）
 test3（）
 time4 = datetime.datetime.now（）
 
打印（'DELTA test1：'+ str（time2-time1））
 print（'DELTA test2：'+ str（time3-time2））
 print（'DELTA test3：'+ str（time4-time3 ））

我的机器上的相应结果（python3.3，熊猫0.15.2） p>

  DELTA test1：0：00：00.131698 
 DELTA test2：0：00：10.034970 
 DELTA test3：0 ：00：05.233389

这是预期的吗？

有没有办法消除性能问题，除了切换代码到Python的默认datetime实现尽可能多吗？

解决方案

我不知道你的用例，所以我只是要创建一个比较datetime / list和pa的简单例子ndas.datetime / dataframe。

tldr：对于一个小数据集，只需使用datetime和列表。对于较大的数据集，请使用pandas.datetime和数据框。

  d1 = datetime.datetime（2015，3，20，10 ，0，0）
 d2 = datetime.datetime（2015，3，20，15，0，0）
 
 ts_pandas = pd.Series（pd.date_range（d1，periods = 1000 ，freq ='H'））
 ts_list = ts_pandas.tolist（）
 
 delta_pandas = ts_pandas  -  d2 
 delta_list = [t-d2 for t in ts_list]

在计时之前，让我们检查一下我们得到的相同答案：

  for i in range（5）：print delta_pandas [i]，delta_list [i] 
 
 -1天+19：00：00-1天+19：00：00 
 -1天+20：00：00 -1天+20：00：00 
 -1天+21：00：00-1天+21：00：00 
 -1天+22：00：00 -1天+22：00：00 
 -1天+23：00：00 -1天+23：00：00

看起来对于1000的大小来说，让我们为大小从10到100,000的时间：

 为[10,100,1000,100000]中的sz 
 
 ts_pandas = pd.Series（pd.da te_range（d1，periods = sz，freq ='H'）
 ts_list = ts_pandas.tolist（）
 
％timeit [t-d2 for ts in ts_list] 
％时间ts_pandas  -  d2 
 
 1000循环，最佳3：247μs每循环＃size = 10 
 1000循环，最好3：601μs每循环
 
 100循环，最佳3：2.55 ms每循环＃size = 100 
 1000循环，最好3：682μs每循环
 
 10循环，最好3：23.6 ms每循环＃ size = 1,000 
 1000循环，最佳3：616μs每循环
 
 1循环，最好3：2.41 s每循环＃size = 100,000 
 100循环，最好的3：每循环3.32 ms

希望这些是您期望的结果。基于列表的计算的速度在大小上是线性的。基于大熊猫计算的速度基本上是从10到1000（由于熊猫开销）的不变。在某点应该变得大致线性。我没有试图弄清楚哪里，但即使是10,000也是相同的速度，所以在10,000到100,000之间。

 
I seem to be running into unexpectedly slow performance of arithmetic operations on pandas.Timestamp vs python regular datetime() objects.

Here is a benchmark that demonstrates:
import datetime
import pandas
import numpy

# using datetime:
def test1():
    d1 = datetime.datetime(2015, 3, 20, 10, 0, 0)
    d2 = datetime.datetime(2015, 3, 20, 10, 0, 15)
    delta = datetime.timedelta(minutes=30)

    count = 0
    for i in range(500000):
        if d2 - d1 > delta:
            count += 1

# using pandas:
def test2():
    d1 = pandas.datetime(2015, 3, 20, 10, 0, 0)
    d2 = pandas.datetime(2015, 3, 20, 10, 0, 15)
    delta = pandas.Timedelta(minutes=30)

    count = 0
    for i in range(500000):
        if d2 - d1 > delta:
            count += 1

# using numpy
def test3():
    d1 = numpy.datetime64('2015-03-20 10:00:00')
    d2 = numpy.datetime64('2015-03-20 10:00:15')
    delta = numpy.timedelta64(30, 'm')

    count = 0
    for i in range(500000):
        if d2 - d1 > delta:
            count += 1


  time1 = datetime.datetime.now()
  test1()
  time2 = datetime.datetime.now()
  test2()
  time3 = datetime.datetime.now()
  test3()
  time4 = datetime.datetime.now()

  print('DELTA test1: ' + str(time2-time1))
  print('DELTA test2: ' + str(time3-time2))
  print('DELTA test3: ' + str(time4-time3))
And corresponding results on my machine (python3.3, pandas 0.15.2):
DELTA test1: 0:00:00.131698
DELTA test2: 0:00:10.034970
DELTA test3: 0:00:05.233389
Is this expected?

Are there ways to eliminate the performance problem other than switching code to Python's default datetime implementation as much as possible?
 解决方案 
I don't know your use case, so I'm just going to create a simple example comparing datetime/list vs pandas.datetime/dataframe.

tldr:  for a small dataset, just use datetime and a list.  For a larger dataset, use pandas.datetime and a dataframe.
d1 = datetime.datetime(2015, 3, 20, 10, 0, 0)
d2 = datetime.datetime(2015, 3, 20, 15, 0, 0)

ts_pandas = pd.Series( pd.date_range(d1, periods=1000, freq='H'))
ts_list   = ts_pandas.tolist()

delta_pandas = ts_pandas - d2
delta_list   = [ t - d2 for t in ts_list ]
Before timing, let's check that we get the same answers:
for i in range(5):  print delta_pandas[i], delta_list[i]

-1 days +19:00:00 -1 days +19:00:00
-1 days +20:00:00 -1 days +20:00:00
-1 days +21:00:00 -1 days +21:00:00
-1 days +22:00:00 -1 days +22:00:00
-1 days +23:00:00 -1 days +23:00:00
Looks good for size of 1000, let's time things for sizes ranging from 10 to 100,000:
for sz in [10,100,1000,100000]:

    ts_pandas = pd.Series( pd.date_range(d1, periods=sz, freq='H'))
    ts_list   = ts_pandas.tolist()

    %timeit [ t - d2 for t in ts_list ]
    %timeit ts_pandas - d2

1000 loops, best of 3: 247 µs per loop  # size = 10
1000 loops, best of 3: 601 µs per loop

100 loops, best of 3: 2.55 ms per loop  # size = 100
1000 loops, best of 3: 682 µs per loop

10 loops, best of 3: 23.6 ms per loop   # size = 1,000
1000 loops, best of 3: 616 µs per loop

1 loops, best of 3: 2.41 s per loop     # size = 100,000
100 loops, best of 3: 3.32 ms per loop
Hopefully these are the results you'd expect.  The speed of the list-based calculation is linear in size.  The speed of the pandas-based calculation is basically constant for sizes from 10 to 1,000 (due to pandas overhead).  It should become roughly linear at some point.  I didn't try to figure out exactly where, but even 10,000 was about the same speed so somewhere between 10,000 and 100,000.

                        这篇关于大 pandas 时间戳与日期时间的缓慢表现的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

大 pandas 时间戳与日期时间的缓慢表现 [英] Slow performance of pandas timestamp vs datetime

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

大 pandas 时间戳与日期时间的缓慢表现 [英] Slow performance of pandas timestamp vs datetime

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭