Python-在字符串中查找日期 [英] Python - finding date in a string

查看:229
本文介绍了Python-在字符串中查找日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够读取字符串并返回第一个出现在其中的日期。我可以使用现成的模块吗?我试图为所有可能的日期格式编写正则表达式,但是它很长。有更好的方法吗?

I want to be able to read a string and return the first date appears in it. Is there a ready module that I can use? I tried to write regexs for all possible date format, but it is quite long. Is there a better way to do it?

推荐答案

您可以对文本的所有子文本运行日期解析器,然后选择第一个日期。当然,这样的解决方案要么捕获不是日期的东西,要么捕获不到日期的东西,或者很可能同时捕获两者。

You can run a date parser on all subtexts of your text and pick the first date. Of course, such solution would either catch things that are not dates or would not catch things that are, or most likely both.

让我提供一个使用 dateutil.parser 可以捕获任何看起来像日期的内容:

Let me provide an example that uses dateutil.parser to catch anything that looks like a date:

import dateutil.parser
from itertools import chain
import re

# Add more strings that confuse the parser in the list
UNINTERESTING = set(chain(dateutil.parser.parserinfo.JUMP, 
                          dateutil.parser.parserinfo.PERTAIN,
                          ['a']))

def _get_date(tokens):
    for end in xrange(len(tokens), 0, -1):
        region = tokens[:end]
        if all(token.isspace() or token in UNINTERESTING
               for token in region):
            continue
        text = ''.join(region)
        try:
            date = dateutil.parser.parse(text)
            return end, date
        except ValueError:
            pass

def find_dates(text, max_tokens=50, allow_overlapping=False):
    tokens = filter(None, re.split(r'(\S+|\W+)', text))
    skip_dates_ending_before = 0
    for start in xrange(len(tokens)):
        region = tokens[start:start + max_tokens]
        result = _get_date(region)
        if result is not None:
            end, date = result
            if allow_overlapping or end > skip_dates_ending_before:
                skip_dates_ending_before = end
                yield date


test = """Adelaide was born in Finchley, North London on 12 May 1999. She was a 
child during the Daleks' abduction and invasion of Earth in 2009. 
On 1st July 2058, Bowie Base One became the first Human colony on Mars. It 
was commanded by Captain Adelaide Brooke, and initially seemed to prove that 
it was possible for Humans to live long term on Mars."""

print "With no overlapping:"
for date in find_dates(test, allow_overlapping=False):
    print date


print "With overlapping:"
for date in find_dates(test, allow_overlapping=True):
    print date

无论是否允许重叠,代码的结果都是毫无疑问的。如果允许重叠,则将获得很多日期都看不到的日期;如果不允许,则会错过文本中的重要日期。

The result from the code is, quite unsurprisingly, rubbish whether you allow overlapping or not. If overlapping is allowed, you get a lot of dates that are nowhere to be seen, and if if it is not allowed, you miss the important date in the text.

With no overlapping:
1999-05-12 00:00:00
2009-07-01 20:58:00
With overlapping:
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-12 00:00:00
1999-05-03 00:00:00
1999-05-03 00:00:00
1999-07-03 00:00:00
1999-07-03 00:00:00
2009-07-01 20:58:00
2009-07-01 20:58:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-01 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00
2058-07-03 00:00:00

基本上,如果允许重叠:

Essentially, if overlapping is allowed:


  1. 1999年5月12日解析为1999-05-12 00:00:00

  2. 1999年5月12日解析为1999-05-03 00 :00:00(因为今天是每月的第3天)

但是,如果不允许重叠,则为 2009。在2058年7月1日被解析为2009-07-01 20:58:00,并且没有尝试解析该时间段之后的日期。

If, however, overlapping is not allowed, "2009. On 1st July 2058" is parsed as 2009-07-01 20:58:00 and no attempt is made to parse the date after the period.

这篇关于Python-在字符串中查找日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆