如何从日期字符串确定适当的strftime格式? [英] How to determine appropriate strftime format from a date string?

查看:65
本文介绍了如何从日期字符串确定适当的strftime格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

dateutil 解析器在正确猜测各种来源的日期和时间方面做得很好。

The dateutil parser does a great job of correctly guessing the date and time from a wide variety of sources.

我们正在处理文件,其中每个文件仅使用一种日期/时间格式,但是文件之间的格式有所不同。分析显示 dateutil.parser.parse 使用了很多时间。由于每个文件只需确定一次,因此执行每次都不会猜测格式的操作可以加快速度。

We are processing files in which each file uses only one date/time format, but the format varies between files. Profiling shows a lot of time being used by dateutil.parser.parse. Since it only needs to be determined once per file, implementing something that isn't guessing the format each time could speed things up.

我实际上并不预先知道格式,我仍然需要推断格式。像这样:

I don't actually know the formats in advance and I'll still need to infer the format. Something like:

from MysteryPackage import date_string_to_format_string
import datetime

# e.g. mystring = '1 Jan 2016'
myformat = None

...

# somewhere in a loop reading from a file or connection:
if myformat is None:
    myformat = date_string_to_format_string(mystring)

# do the usual checks to see if that worked, then:
mydatetime = datetime.strptime(mystring, myformat)

有这样的功能吗?

推荐答案

这是一个棘手的问题。我的方法使用正则表达式和(?(DEFINE)...)语法,只有新的 regex 模块。

< hr>
本质上, DEFINE 让我们先定义子例程,然后再进行匹配,因此首先我们定义日期猜测功能所需的所有积木:

This is a tricky one. My approach makes use of regular expressions and the (?(DEFINE)...) syntax which is only supported by the newer regex module.


Essentially, DEFINE let us define subroutines prior to matching them, so first of all we define all needed bricks for our date guessing function:

    (?(DEFINE)
        (?P<year_def>[12]\d{3})
        (?P<year_short_def>\d{2})
        (?P<month_def>January|February|March|April|May|June|
        July|August|September|October|November|December)
        (?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
        (?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01]))
        (?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)
        (?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun)
        (?P<hms_def>\d{2}:\d{2}:\d{2})
        (?P<hm_def>\d{2}:\d{2})
            (?P<ms_def>\d{5,6})
            (?P<delim_def>([-/., ]+|(?<=\d|^)T))
        )
        # actually match them
        (?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)|
        (?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$)
        """, re.VERBOSE)

之后,我们需要考虑可能的分隔符:

After this, we need to think of possible delimiters:

# delim
delim = re.compile(r'([-/., ]+|(?<=\d)T)')

格式映射:

# formats
formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}

函数 GuessFormat()在定界符的帮助下拆分部分,尝试匹配它们并输出 strftime()的相应代码:

The function GuessFormat() splits the parts with the help of the delimiters, tries to match them and outputs the corresponding code for strftime():

def GuessFormat(datestring):

    # define the bricks
    bricks = re.compile(r"""
            (?(DEFINE)
                (?P<year_def>[12]\d{3})
                (?P<year_short_def>\d{2})
                (?P<month_def>January|February|March|April|May|June|
                July|August|September|October|November|December)
                (?P<month_short_def>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
                (?P<day_def>(?:0[1-9]|[1-9]|[12][0-9]|3[01]))
                (?P<weekday_def>(?:Mon|Tue|Wednes|Thurs|Fri|Satur|Sun)day)
                (?P<weekday_short_def>Mon|Tue|Wed|Thu|Fri|Sat|Sun)
                (?P<hms_def>T?\d{2}:\d{2}:\d{2})
                (?P<hm_def>T?\d{2}:\d{2})
                (?P<ms_def>\d{5,6})
                (?P<delim_def>([-/., ]+|(?<=\d|^)T))
            )
            # actually match them
            (?P<hms>^(?&hms_def)$)|(?P<year>^(?&year_def)$)|(?P<month>^(?&month_def)$)|(?P<month_short>^(?&month_short_def)$)|(?P<day>^(?&day_def)$)|
            (?P<weekday>^(?&weekday_def)$)|(?P<weekday_short>^(?&weekday_short_def)$)|(?P<hm>^(?&hm_def)$)|(?P<delim>^(?&delim_def)$)|(?P<ms>^(?&ms_def)$)
            """, re.VERBOSE)

    # delim
    delim = re.compile(r'([-/., ]+|(?<=\d)T)')

    # formats
    formats = {'ms': '%f', 'year': '%Y', 'month': '%B', 'month_dec': '%m', 'day': '%d', 'weekday': '%A', 'hms': '%H:%M:%S', 'weekday_short': '%a', 'month_short': '%b', 'hm': '%H:%M', 'delim': ''}

    parts = delim.split(datestring)
    out = []
    for index, part in enumerate(parts):
        try:
            brick = dict(filter(lambda x: x[1] is not None, bricks.match(part).groupdict().items()))
            key = next(iter(brick))

            # ambiguities
            if key == 'day' and index == 2:
                key = 'month_dec'

            item = part if key == 'delim' else formats[key]
            out.append(item)
        except AttributeError:
            out.append(part)

    return "".join(out)

最后的测试:

import regex as re

datestrings = [datetime.now().isoformat(), '2006-11-02', 'Thursday, 10 August 2006 08:42:51', 'August 9, 1995', 'Aug 9, 1995', 'Thu, 01 Jan 1970 00:00:00', '21/11/06 16:30', 
'06 Jun 2017 20:33:10']

# test
for dt in datestrings:
    print("Date: {}, Format: {}".format(dt, GuessFormat(dt)))

这将产生:

Date: 2017-06-07T22:02:05.001811, Format: %Y-%m-%dT%H:%M:%S.%f
Date: 2006-11-02, Format: %Y-%m-%d
Date: Thursday, 10 August 2006 08:42:51, Format: %A, %m %B %Y %H:%M:%S
Date: August 9, 1995, Format: %B %m, %Y
Date: Aug 9, 1995, Format: %b %m, %Y
Date: Thu, 01 Jan 1970 00:00:00, Format: %a, %m %b %Y %H:%M:%S
Date: 21/11/06 16:30, Format: %d/%m/%d %H:%M
Date: 06 Jun 2017 20:33:10, Format: %d %b %Y %H:%M:%S

这篇关于如何从日期字符串确定适当的strftime格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆