如何在 Python 中使用 NLP、RegEx 查找句子中的日期 [英] How to find dates in the sentence using NLP, RegEx in Python

查看:26
本文介绍了如何在 Python 中使用 NLP、RegEx 查找句子中的日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能给我建议一些查找和解析日期的方法(以任何格式,Aug06"、Aug2006"、2008 年 8 月 2 日"、2006 年 8 月 19 日"、08-06"、01-08-06") 在 python 中.

Can anyone suggest me some way of finding and parsing dates (in any format, "Aug06", "Aug2006", "August 2 2008", "19th August 2006", "08-06", "01-08-06") in the python.

我遇到了这个问题,但它是在 perl...从字符串中提取格式不一致的日期(日期解析,NLP)

I came across this question, but it is in perl... Extract inconsistently formatted date from string (date parsing, NLP)

任何建议都会有所帮助.

Any suggestion would be helpful.

推荐答案

这会查找例句中的所有日期:

This finds all the dates in your example sentence:

for match in re.finditer(
    r"""(?ix)             # case-insensitive, verbose regex
                        # match a word boundary
    (?:                   # match the following three times:
     (?:                  # either
      d+                 # a number,
      (?:.|st|nd|rd|th)* # followed by a dot, st, nd, rd, or th (optional)
      |                   # or a month name
      (?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)
     )
     [s./-]*             # followed by a date separator or whitespace (optional)
    ){3}                  # do this three times
                        # and end at a word boundary.""", 
    subject):
    # match start: match.start()
    # match end (exclusive): match.end()
    # matched text: match.group()

它绝对不是完美的并且容易错过一些日期(特别是如果它们不是英文的 - 21. Mai 2006 会失败,以及 4ème décembre 1999), 并匹配 August Augst Aug 之类的废话,但由于在您的示例中几乎所有内容都是可选的,因此您在正则表达式级别无能为力.

It's definitely not perfect and liable to miss some dates (especially if they are not in English - 21. Mai 2006 would fail, as well as 4ème décembre 1999), and to match nonsense like August Augst Aug, but since nearly everything is optional in your examples, there is not much you can do at the regex level.

下一步是将所有匹配项输入解析器,看看它是否可以将它们解析为合理的日期.

The next step would be to feed all the matches into a parser and see if it can parse them into a sensible date.

正则表达式无法正确解释上下文.想象一个(愚蠢的)文本,如 你会在框 21 中找到它.8 月 3 日将是发货日期. 它将匹配 21.8 月 3 日当然不能解析.

The regex can't interpret context correctly. Imagine a (stupid) text like You'll find it in box 21. August 3rd will be the shipping date. It will match 21. August 3rd which of course can't be parsed.

这篇关于如何在 Python 中使用 NLP、RegEx 查找句子中的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆