如何在Python中使用NLP,RegEx在句子中查找日期 [英] How to find dates in the sentence using NLP, RegEx in Python

查看:108
本文介绍了如何在Python中使用NLP,RegEx在句子中查找日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能建议我一些查找和解析日期的方式(以任何格式,"Aug06","Aug2006","2008年8月2日","2006年8月19日","08-06","01-08-的"06").

Can anyone suggest me some way of finding and parsing dates (in any format, "Aug06", "Aug2006", "August 2 2008", "19th August 2006", "08-06", "01-08-06") in the python.

我遇到了这个问题,但这是完全错误的... 从字符串中提取格式不一致的日期(日期解析,NLP)

I came across this question, but it is in perl... Extract inconsistently formatted date from string (date parsing, NLP)

任何建议都会有所帮助.

Any suggestion would be helpful.

推荐答案

这将查找示例语句中的所有日期:

This finds all the dates in your example sentence:

for match in re.finditer(
    r"""(?ix)             # case-insensitive, verbose regex
    \b                    # match a word boundary
    (?:                   # match the following three times:
     (?:                  # either
      \d+                 # a number,
      (?:\.|st|nd|rd|th)* # followed by a dot, st, nd, rd, or th (optional)
      |                   # or a month name
      (?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)
     )
     [\s./-]*             # followed by a date separator or whitespace (optional)
    ){3}                  # do this three times
    \b                    # and end at a word boundary.""", 
    subject):
    # match start: match.start()
    # match end (exclusive): match.end()
    # matched text: match.group()

绝对不完美并且容易错过某些日期(尤其是如果它们不是英语-21. Mai 20064ème décembre 1999一样会失败),并且匹配像August Augst Aug这样的废话,但是由于几乎所有内容都是在示例中是可选的,在正则表达式级别上您无能为力.

It's definitely not perfect and liable to miss some dates (especially if they are not in English - 21. Mai 2006 would fail, as well as 4ème décembre 1999), and to match nonsense like August Augst Aug, but since nearly everything is optional in your examples, there is not much you can do at the regex level.

下一步是将所有匹配项输入解析器,看看是否可以将它们解析为合理的日期.

The next step would be to feed all the matches into a parser and see if it can parse them into a sensible date.

正则表达式无法正确解释上下文.想象一个You'll find it in box 21. August 3rd will be the shipping date.这样的(愚蠢)文本,它将匹配21. August 3rd当然不能被解析.

The regex can't interpret context correctly. Imagine a (stupid) text like You'll find it in box 21. August 3rd will be the shipping date. It will match 21. August 3rd which of course can't be parsed.

这篇关于如何在Python中使用NLP,RegEx在句子中查找日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆