慢pd.to_datetime() [英] Slow pd.to_datetime()

查看:598
本文介绍了慢pd.to_datetime()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读两种非常相似的csv文件. 它们的长度大约相同,为20000行.每行代表每秒记录的参数. 因此,第一列是时间戳.

I am reading two types of csv files that are very similar. They are about the same lenght, 20 000 lines. Each line represent parameters recorded each second. Thus, the first column is the timestamp.

  • 在第一个文件中,模式如下:2018-09-24 15:38
  • 在第二个文件中,模式如下:2018-09-24 03:38:06 PM

在两种情况下,命令都相同:

In both cases, the command is the same:

data = pd.read_csv(file)
data['Timestamp'] = pd.to_datetime(data['Timestamp'])

我检查这两行的执行时间:

I check the execution time for both lines:

  • pd.read在两种情况下均有效
  • 执行代码的第二行大约需要3到4秒的时间

唯一的区别是日期模式.我不会怀疑的.你知道为什么吗?你知道如何解决这个问题吗?

The only difference is the date pattern. I would not have suspected that. Do you know why? Do you know how to fix this?

推荐答案

pandas.to_datetime极其慢(在某些情况下).由于您似乎了解这些格式,因此应将它们显式传递给format参数,这将大大提高速度.

pandas.to_datetime is extremely slow (in certain instances) when it needs to parse the dates automatically. Since it seems like you know the formats, you should explicitly pass them to the format parameter, which will greatly improve the speed.

这是一个例子:

import pandas as pd
df1 = pd.DataFrame({'Timestamp': ['2018-09-24 15:38:06']*10**5})
df2 = pd.DataFrame({'Timestamp': ['2018-09-24 03:38:06 PM']*10**5})

%timeit pd.to_datetime(df1.Timestamp)
#21 ms ± 50.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.to_datetime(df2.Timestamp)
#14.3 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

那慢了700倍.现在,明确指定格式:

That's 700x slower. Now specify the format explicitly:

%timeit pd.to_datetime(df2.Timestamp, format='%Y-%m-%d %I:%M:%S %p')
#384 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

pandas仍在以较慢的速度解析第二个日期格式,但是它并没有以前那么糟糕.

pandas is still parsing the second date format more slowly, but it's not nearly as bad as it was before.

这篇关于慢pd.to_datetime()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆