在Python中使用Dateutil时,某些日期格式的提取失败 [英] Extraction of some date formats failed when using Dateutil in Python

查看:61
本文介绍了在Python中使用Dateutil时,某些日期格式的提取失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在发布此问题之前,我已经浏览了多个链接,因此请通读以下内容,以下两个答案已解决了我90%的问题:

使用dateutil解析多个日期

解决方案

这种问题总是需要在新的边缘情况下进行,但以下方法相当可靠:

来自itertools导入分组依据的

 ,izip_longest从datetime导入datetime,timedelta导入日历导入字符串汇入def get_date_part(x):如果x.lower()在month_list中:返回x日期= re.match(r'(\ d +)(\ b | st | nd | rd | th)',x,re.I)如果一天:返回day.group(1)返回Falsedef month_full(month):尝试:返回datetime.strptime(month,'%B').strftime('%b')除了:返回datetime.strptime(month,'%b').strftime('%b')测试= [我想从5月16日至5月18日访问",我想从5月16日至18日访问",我想从5月6日至5月18日访问",'5月6,7,8,9,10','5月8日至6月10日',"7月10/20/30",请从7月5日至7月5日至8月5日","3月2日至1月3日",'3月15日,2月10日,1月5日','2017年11月1日',"2010年10月27日至1月1日",'2010年10月27日至2012年1月1日']当前年份= 2017month_list = [如果有len(m),则list(calendar.month_name)中的m的m.lower()+ list(calendar.month_abbr)中的m]remove_punc = string.maketrans(字符串.标点符号,''* len(字符串.标点符号))测试中的日期:date_parts = [如果为get_date_part(part),则date.translate(remove_punc).split()中的部分为get_date_part(part)]天= []个月= []年= []对于groupby(sorted(date_parts,key = lambda x:x.isdigit()),lambda y:not y.isdigit())中的k,g:值=清单(g)如果k:月=地图(month_full,值)别的:对于v in值:如果1900< = int(v)< = 2100:years.append(int(v))别的:days.append(v)如果是几天和几个月:如果年份:date_raw = [datetime.strptime('{} {} {}'.format(m,d,y),'%b%d%Y')表示izip_longest(月,日,年,fillvalue中的m,d,y=年[0])]别的:date_raw = [datetime.strptime('{} {}'.format(m,d ,,'%b%d').replace(year = cur_year)for m,d in izip_longest(months,days,fillvalue = months [0])]年= [cur_year]#修正一年中的跳跃日期= []start_date = datetime(years [0],1,1)next_year =年[0] + 1对于date_raw中的d:如果d <开始日期:d = d.replace(year = next_year)next_year + = 1开始日期= ddates.append(d)打印"{}-> {}".format(date,','.join(d.strftime(%d/%m/%Y")表示日期中的d)) 

这将如下转换测试字符串:

 我想从5月16日至5月18日访问->16/05/2017,18/05/2017我想从5月16日至18日访问->16/05/2017,18/05/2017我想从5月6日至5月18日访问->2017/06/05,2017/05/18五月6,7,8,9,10->06/05/2017,07/05/2017,08/05/2017,09/05/2017,10/05/2017< 5月8日至6月10日->2017年8月5日,2017年10月6日7月10/20/30->10/07/2017,20/07/2017,30/07/2017请从7月6日至7月5日至8月5日->2017年1月6日,2017年5月7日,2017年5月8日3月2日至1月3日->2017/02/03,2018/03/013月15日,2月10日,1月5日->15/03/2017,10/02/2018,05/01/20192017年11月1日->2017年1月11日2010年10月27日至1月1日->2010年10月27日,2011年1月1日2010年10月27日至2012年1月1日->2010年10月27日,2012年1月1日 

其工作原理如下:

  1. 首先创建一个有效月份名称的列表,即完整名称和缩写名称.

  2. 制作翻译表,以便轻松快速地从文本中删除任何标点符号.

  3. 分割文本,并使用带有正则表达式的函数提取日期或月份,仅提取日期部分.

  4. 根据零件是否为数字对列表进行排序,这会将前几个月和最后几位分组.

  5. 获取每个列表的第一部分和最后一部分.将月份转换为完整格式,例如 Aug August ,并将每个对象转换为 datetime 对象.

  6. 如果日期似乎在前一个日期之前,则加整年.

I have gone through multiple links before posting this question so please read through and below are the two answers which have solved 90% of my problem:

parse multiple dates using dateutil

How to parse multiple dates from a block of text in Python (or another language)

Problem: I need to parse multiple dates in multiple formats in Python

Solution by Above Links: I am able to do so but there are still certain formats which I am not able to do so.

Formats which still can't be parsed are:

  1. text ='I want to visit from May 16-May 18'

  2. text ='I want to visit from May 16-18'

  3. text ='I want to visit from May 6 May 18'

I have tried regex also but since dates can come in any format,so ruled out that option because the code was getting very complex. Hence, Please suggest me modifications on the code presented on the link, so that above 3 formats can also be handled on the same.

解决方案

This kind of problem is always going to need tweeking with new edge cases, but the following approach is fairly robust:

from itertools import groupby, izip_longest
from datetime import datetime, timedelta
import calendar
import string
import re


def get_date_part(x):
    if x.lower() in month_list:
        return x

    day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I)

    if day:
        return day.group(1)

    return False


def month_full(month):
    try:
        return datetime.strptime(month, '%B').strftime('%b')
    except:
        return datetime.strptime(month, '%b').strftime('%b')

tests = [
    'I want to visit from May 16-May 18',
    'I want to visit from May 16-18',
    'I want to visit from May 6 May 18',
    'May 6,7,8,9,10',
    '8 May to 10 June',
    'July 10/20/30',
    'from June 1, july 5 to aug 5 please',
    '2nd March to the 3rd January',
    '15 march, 10 feb, 5 jan',
    '1 nov 2017',
    '27th Oct 2010 until 1st jan',
    '27th Oct 2010 until 1st jan 2012'
    ]

cur_year = 2017    

month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)]
remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation))

for date in tests:
    date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)]

    days = []
    months = []
    years = []

    for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()):
        values = list(g)

        if k:
            months = map(month_full, values)
        else:
            for v in values:
                if 1900 <= int(v) <= 2100:
                    years.append(int(v))
                else:
                    days.append(v)

        if days and months:
            if years:
                dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])]            
            else:
                dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])]
                years = [cur_year]

            # Fix for jumps in year
            dates = []
            start_date = datetime(years[0], 1, 1)
            next_year = years[0] + 1

            for d in dates_raw:
                if d < start_date:
                    d = d.replace(year=next_year)
                    next_year += 1
                start_date = d
                dates.append(d)

            print "{}  ->  {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates))

This converts the test strings as follows:

I want to visit from May 16-May 18  ->  16/05/2017, 18/05/2017
I want to visit from May 16-18  ->  16/05/2017, 18/05/2017
I want to visit from May 6 May 18  ->  06/05/2017, 18/05/2017
May 6,7,8,9,10  ->  06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017
8 May to 10 June  ->  08/05/2017, 10/06/2017
July 10/20/30  ->  10/07/2017, 20/07/2017, 30/07/2017
from June 1, july 5 to aug 5 please  ->  01/06/2017, 05/07/2017, 05/08/2017
2nd March to the 3rd January  ->  02/03/2017, 03/01/2018
15 march, 10 feb, 5 jan  ->  15/03/2017, 10/02/2018, 05/01/2019
1 nov 2017  ->  01/11/2017
27th Oct 2010 until 1st jan  ->  27/10/2010, 01/01/2011
27th Oct 2010 until 1st jan 2012  ->  27/10/2010, 01/01/2012

This works as follows:

  1. First create a list of valid months names, i.e. both full and abbreviated.

  2. Make a translation table to make it easy to quickly remove any punctuation from the text.

  3. Split the text, and extract only the date parts by using a function with a regular expression to spot days or months.

  4. Sort the list based on whether or not the part is a digit, this will group months to the front and digits to the end.

  5. Take the first and last part of each list. Convert months into full form e.g. Aug to August and convert each into datetime objects.

  6. If a date appears to be before the previous one, add a whole year.

这篇关于在Python中使用Dateutil时,某些日期格式的提取失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆