如何在 python 中解析多个(未知)日期格式? [英] How can I parse multiple (unknown) date formats in python?

查看:26
本文介绍了如何在 python 中解析多个(未知)日期格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆要从中提取日期的 Excel 文档.我正在尝试将这些转换为标准格式,以便将它们放入数据库中.有没有一个函数可以抛出这些字符串并返回标准格式?这是我的数据的一个小样本:

I have a bunch of excel documents I am extracting dates from. I am trying to convert these to a standard format so I can put them in a database. Is there a function I can throw these strings at and get a standard format back? Here is a small sample of my data:

好消息是我知道它总是月/日

The good thing is I know it is always Month/Day

10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
 03-07-2009
09/01/2010

我想将它们全部转换为 MM/DD/YYYY 格式.有没有一种方法可以在不针对字符串尝试每个模式的情况下执行此操作?

I'd like to get them all into MM/DD/YYYY format. Is there a way I can do this without trying each pattern against the string?

推荐答案

import re

ss = '''10/02/09
07/22/09
09-08-2008
9/9/2008
11/4/2010
03-07-2009
09/01/2010'''


regx = re.compile('[-/]')
for xd in ss.splitlines():
    m,d,y = regx.split(xd)
    print xd,'   ','/'.join((m.zfill(2),d.zfill(2),'20'+y.zfill(2) if len(y)==2 else y))

结果

10/02/09     10/02/2009
07/22/09     07/22/2009
09-08-2008     09/08/2008
9/9/2008     09/09/2008
11/4/2010     11/04/2010
03-07-2009     03/07/2009
09/01/2010     09/01/2010

编辑 1

Edit 2 :考虑到 JBernardo 关于 '{0:0>2}'.format(day) 的信息,我添加了第四个解决方案,这似乎是最快的

Edit 1

And Edit 2 : taking account of the information on '{0:0>2}'.format(day) from JBernardo, I added a 4th solution, that appears to be the fastest

import re
from time import clock
iterat = 100

from datetime import datetime
dates = ['10/02/09', '07/22/09', '09-08-2008', '9/9/2008', '11/4/2010',
         ' 03-07-2009', '09/01/2010']

reobj = re.compile(
r"""s*  # optional whitespace
(d+)    # Month
[-/]     # separator
(d+)    # Day
[-/]     # separator
(?:20)?  # century (optional)
(d+)    # years (YY)
s*      # optional whitespace""",
re.VERBOSE)

te = clock()
for i in xrange(iterat):
    ndates = (reobj.sub(r"1/2/203", date) for date in dates)
    fdates1 = [datetime.strftime(datetime.strptime(date,"%m/%d/%Y"), "%m/%d/%Y")
               for date in ndates]
print "Tim's method   ",clock()-te,'seconds'



regx = re.compile('[-/]')


te = clock()
for i in xrange(iterat):
    ndates = (reobj.match(date).groups() for date in dates)
    fdates2 = ['%s/%s/20%s' % tuple(x.zfill(2) for x in tu) for tu in ndates]
print "mixing solution",clock()-te,'seconds'


te = clock()
for i in xrange(iterat):
    ndates = (regx.split(date.strip()) for date in dates)
    fdates3 = ['/'.join((m.zfill(2),d.zfill(2),('20'+y.zfill(2) if len(y)==2 else y)))
              for m,d,y in ndates]
print "eyquem's method",clock()-te,'seconds'



te = clock()
for i in xrange(iterat):
    fdates4 = ['{:0>2}/{:0>2}/20{}'.format(*reobj.match(date).groups()) for date in dates]
print "Tim + format   ",clock()-te,'seconds'


print fdates1==fdates2==fdates3==fdates4

结果

number of iteration's turns : 100
Tim's method    0.295053700959 seconds
mixing solution 0.0459111423379 seconds
eyquem's method 0.0192239516475 seconds
Tim + format    0.0153756971906 seconds 
True

混合解决方案很有趣,因为它结合了我的解决方案的速度和 Tim Pietzcker 的正则表达式检测字符串中日期的能力.

The mixing solution is interesting because it combines the speed of my solution and the ability of the regex of Tim Pietzcker to detect dates in a string.

对于将 Tim 的解决方案与 {:0>2} 的格式相结合的解决方案来说更是如此.我不能将 {:0>2} 与我的结合起来,因为 regx.split(date.strip()) 产生 2 或 4 位数字的年份

That's still more true for the solution combining Tim's one and the formating with {:0>2}. I cant' combine {:0>2} with mine because regx.split(date.strip()) produces year with 2 OR 4 digits

这篇关于如何在 python 中解析多个(未知)日期格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆