从字符串中删除识别的日期 [英] remove recognized date from string

查看:58
本文介绍了从字符串中删除识别的日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为输入,我有几个包含日期格式的字符串,例如

As input I have several strings containing dates in different formats like

  • 彼得在 16:45 喝茶"
  • 我的生日是 1990年7月8日"
  • " 7月11日星期六,我会回家"
  • "Peter drinks tea at 16:45"
  • "My birthday is on 08-07-1990"
  • "On Sat 11 July I'll be back home"

我使用 dateutil.parser.parse 识别字符串中的日期.
在下一步中,我想从字符串中删除日期.结果应该是

I use dateutil.parser.parse to recognize the dates in the strings.
In the next step I want to remove the dates from the strings. Result should be

  • 彼得在"喝茶
  • 我的生日在"
  • 我要回家了"

有没有简单的方法可以实现这一目标?

Is there a simple way to achieve this?

推荐答案

您可以使用 dateutil.parser.parse fuzzy_with_tokens 选项:

You can use the fuzzy_with_tokens option to dateutil.parser.parse:

from dateutil.parser import parse

dtstrs = [
    "Peter drinks tea at 16:45",
    "My birthday is on 08-07-1990",
    "On Sat 11 July I'll be back home",
    ]

out = [
    parse(dtstr, fuzzy_with_tokens=True)
    for dtstr in dtstrs
]

结果:

[(datetime.datetime(2018, 7, 17, 16, 45), ('Peter drinks tea at ',)),
 (datetime.datetime(1990, 8, 7, 0, 0), ('My birthday is on ',)),
 (datetime.datetime(2018, 7, 11, 0, 0), ('On ', ' ', " I'll be back home"))]

fuzzy_with_tokens 为true时,解析器将返回一个 datetime 的元组和一个被忽略的令牌的元组(已删除使用的令牌).您可以将它们重新连接成这样的字符串:

When fuzzy_with_tokens is true, the parser returns a tuple of a datetime and a tuple of ignored tokens (with the used tokens removed). You can join them back into a string like this:

>>> ['<missing>'.join(x[1]) for x in out]
['Peter drinks tea at ',
 'My birthday is on ',
 "On <missing> <missing> I'll be back home"]

我将注意到模糊解析逻辑并不是非常可靠,因为很难从字符串中仅选择有效组件并使用它们.例如,如果您将喝茶的人更改为名为April的人,则:

I'll note that the fuzzy parsing logic is not amazingly reliable, because it's very difficult to pick out only valid components from a string and use them. If you change the person drinking tea to someone named April, for example:

>>> dt, tokens = parse("April drinks tea at 16:45", fuzzy_with_tokens=True)
>>> print(dt)
2018-04-17 16:45:00
>>> print('<missing>'.join(tokens))
 drinks tea at 

因此,我强烈建议您使用这种方法(尽管我不能真正推荐一种更好的方法,但这只是一个难题).

So I would urge some caution with this approach (though I can't really recommend a better approach, this is just a hard problem).

这篇关于从字符串中删除识别的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆