为什么 pandas.to_datetime 对于非标准时间格式(例如“2014/12/31")很慢 [英] Why is pandas.to_datetime slow for non standard time format such as '2014/12/31'

查看:49
本文介绍了为什么 pandas.to_datetime 对于非标准时间格式(例如“2014/12/31")很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这种格式的 .csv 文件

I have a .csv file in such format

timestmp, p
2014/12/31 00:31:01:9200, 0.7
2014/12/31 00:31:12:1700, 1.9
...

当通过 pd.read_csv 读取并使用 pd.to_datetime 将时间 str 转换为 datetime 时,性能急剧下降.这是一个最小的例子.

and when read via pd.read_csv and convert the time str to datetime using pd.to_datetime, the performance drops dramatically. Here is a minimal example.

import re
import pandas as pd

d = '2014-12-12 01:02:03.0030'
c = re.sub('-', '/', d)

%timeit pd.to_datetime(d)
%timeit pd.to_datetime(c)
%timeit pd.to_datetime(c, format="%Y/%m/%d %H:%M:%S.%f")

和表演是:

10000 loops, best of 3: 62.4 µs per loop
10000 loops, best of 3: 181 µs per loop
10000 loops, best of 3: 82.9 µs per loop

那么,从 csv 文件读取日期时,如何提高 pd.to_datetime 的性能?

so, how could I improve the performance of pd.to_datetime when reading date from a csv file?

推荐答案

这是因为 Pandas 回退到 dateutil.parser.parse 来解析字符串,当它具有非默认格式或没有提供 format 字符串(这更灵活,但也更慢).

This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).

如上所示,您可以通过向 to_datetime 提供 format 字符串来提高性能.或者另一种选择是使用 infer_datetime_format=True

As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True

显然,infer_datetime_format 无法推断何时有微秒.通过一个没有这些的例子,你可以看到一个很大的加速:

Apparently, the infer_datetime_format cannot infer when there are microseconds. With an example without those, you can see a large speed-up:

In [28]: d = '2014-12-24 01:02:03'

In [29]: c = re.sub('-', '/', d)

In [30]: s_c = pd.Series([c]*10000)

In [31]: %timeit pd.to_datetime(s_c)
1 loops, best of 3: 1.14 s per loop

In [32]: %timeit pd.to_datetime(s_c, infer_datetime_format=True)
10 loops, best of 3: 105 ms per loop

In [33]: %timeit pd.to_datetime(s_c, format="%Y/%m/%d %H:%M:%S")
10 loops, best of 3: 99.5 ms per loop

这篇关于为什么 pandas.to_datetime 对于非标准时间格式(例如“2014/12/31")很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆