使用Python strptime将日期字符串列表转换为日期时间非常慢 [英] Convert list of datestrings to datetime very slow with Python strptime

查看:271
本文介绍了使用Python strptime将日期字符串列表转换为日期时间非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些数据文件,其中包含代表ISO格式日期的字符串列表.目前,我正在使用以下方式阅读它们:

I have data files containing lists of strings representing ISO formatted dates. Currently, I am reading them in using:

mydates = [ datetime.datetime.strptime(timdata[x], "%Y-%m-%dT%H:%M:%S") for x in range(len(timedata)) ]

这看起来很简单,但是当在大约25000个日期的巨大列表上操作时,速度却慢得可笑->每个转换后的列表大约需要0.34秒. 由于我有成千上万个这样的列表,所以我正在寻找一种更快的方法.但是,我还找不到. dateutil解析器的性能甚至更差...

This looks quite straightforward, but is ridiculously slow when operating on huge lists of ~25000 dates -> about 0.34 seconds per converted list. Since I have thousands of such lists I am looking for a faster way. However, I could not find one yet. The dateutil parser performs even worse...

推荐答案

索引/切片似乎比@NPE使用的正则表达式要快:

Indexing / slicing seems to be faster than the regex used by @NPE:

In [47]: def with_indexing(dstr):                              
   ....:     return datetime.datetime(*map(int, [dstr[:4], dstr[5:7], dstr[8:10],
   ....:                               dstr[11:13], dstr[14:16], dstr[17:]])) 

In [48]: p = re.compile('[-T:]')

In [49]: def with_regex(dt_str):
   ....:     return datetime.datetime(*map(int, p.split(dt_str)))

In [50]: %timeit with_regex(dstr)
100000 loops, best of 3: 3.84 us per loop

In [51]: %timeit with_indexing(dstr)
100000 loops, best of 3: 2.98 us per loop

我认为您是否可以使用 numpy.genfromtxt 之类的文件解析器,converters参数和快速字符串解析方法,您可以在不到半秒的时间内读取和解析整个文件.

I think if you would use a file parser like numpy.genfromtxt, the converters argument and a fast string parsing method you can read and parse a whole file in less than a half second.

我使用以下函数创建了一个示例文件,该文件包含约25000行,ISO日期字符串作为索引和10个数据列:

I used the following function to create an example file with about 25000 rows, ISO date strings as index and 10 data columns:

import numpy as np
import pandas as pd

def create_data():
    # create dates
    dates = pd.date_range('2010-01-01T00:30', '2013-01-04T23:30', freq='H')
    # convert to iso
    iso_dates = dates.map(lambda x: x.strftime('%Y-%m-%dT%H:%M:%S'))
    # create data
    data = pd.DataFrame(np.random.random((iso_dates.size, 10)) * 100,
                        index=iso_dates)
    # write to file
    data.to_csv('dates.csv', header=False)

我使用以下代码来解析文件:

Than I used the following code to parse the file:

In [54]: %timeit a = np.genfromtxt('dates.csv', delimiter=',',
                                   converters={0:with_regex})
1 loops, best of 3: 430 ms per loop

In [55]: %timeit a = np.genfromtxt('dates.csv', delimiter=',',
                                   converters={0:with_indexing})
1 loops, best of 3: 391 ms per loop

pandas (基于numpy)具有基于C的文件解析器,它甚至更快:

pandas (based on numpy) has a C-based file parser which is even faster:

In [56]: %timeit df = pd.read_csv('dates.csv', header=None, index_col=0, 
                                  parse_dates=True, date_parser=with_indexing)
10 loops, best of 3: 167 ms per loop

这篇关于使用Python strptime将日期字符串列表转换为日期时间非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆