从文本 Python 中识别和提取日期的最佳方法? [英] Best way to identify and extract dates from text Python?

查看:64
本文介绍了从文本 Python 中识别和提取日期的最佳方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为我正在进行的一个更大的个人项目的一部分,我试图从各种文本源中分离出内嵌日期.

As part of a larger personal project I'm working on, I'm attempting to separate out inline dates from a variety of text sources.

例如,我有一个很大的字符串列表(通常采用英语句子或语句的形式),它们采用多种形式:

For example, I have a large list of strings (that usually take the form of English sentences or statements) that take a variety of forms:

中央设计委员会会议,星期二 10/22 下午 6:30

Central design committee session Tuesday 10/22 6:30 pm

9/19 实验室:串行编码(第 2.2 节)

Th 9/19 LAB: Serial encoding (Section 2.2)

12 月 15 日将有另一个今天无法到达的人.

There will be another one on December 15th for those who are unable to make it today.

工作簿 3(最低工资):截至 9 月 18 日星期三晚上 11:59

Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm

他将于 9 月 15 日起飞.

He will be flying in Sept. 15th.

虽然这些日期与自然文本一致,但它们本身都不是特定的自然语言形式(例如,没有会议将在明天两周后举行"——都是明确的).

While these dates are in-line with natural text, none of them are in specifically natural language forms themselves (e.g., there's no "The meeting will be two weeks from tomorrow"—it's all explicit).

作为对这种处理没有太多经验的人,最好的起点是什么?我已经研究过诸如 dateutil.parser 模块和 parsedatetime 之类的东西,但这些似乎是为了之后你已经隔离了日期.

As someone who doesn't have too much experience with this kind of processing, what would be the best place to begin? I've looked into things like the dateutil.parser module and parsedatetime, but those seem to be for after you've isolated the date.

正因如此,有没有什么好办法提取日期和多余的文字

Because of this, is there any good way to extract the date and the extraneous text

input:  Th 9/19 LAB: Serial encoding (Section 2.2)
output: ['Th 9/19', 'LAB: Serial encoding (Section 2.2)']

或类似的东西?看起来这种处理是由 Gmail 和 Apple Mail 等应用程序完成的,但是否可以用 Python 实现?

or something similar? It seems like this sort of processing is done by applications like Gmail and Apple Mail, but is it possible to implement in Python?

推荐答案

我也在寻找解决方案,但没有找到,所以我和一个朋友开发了一个工具来解决这个问题.我想我会回来分享以防其他人发现它有帮助.

I was also looking for a solution to this and couldn't find any, so a friend and I built a tool to do this. I thought I would come back and share incase others found it helpful.

datefinder -- 在文本中查找和提取日期

这是一个例子:

import datefinder

string_with_dates = '''
    Central design committee session Tuesday 10/22 6:30 pm
    Th 9/19 LAB: Serial encoding (Section 2.2)
    There will be another one on December 15th for those who are unable to make it today.
    Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm
    He will be flying in Sept. 15th.
    We expect to deliver this between late 2021 and early 2022.
'''

matches = datefinder.find_dates(string_with_dates)
for match in matches:
    print(match)

这篇关于从文本 Python 中识别和提取日期的最佳方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆