如何使用Python(或其他语言)从文本块中解析多个日期 [英] How to parse multiple dates from a block of text in Python (or another language)

查看:142
本文介绍了如何使用Python(或其他语言)从文本块中解析多个日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含多个日期值的字符串,我想将它们全部解析出来.该字符串是自然语言,所以到目前为止我发现的最好的东西是 dateutil .

I have a string that has several date values in it, and I want to parse them all out. The string is natural language, so the best thing I've found so far is dateutil.

不幸的是,如果字符串中包含多个日期值,则dateutil会引发错误:

Unfortunately, if a string has multiple date values in it, dateutil throws an error:

>>> s = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"
>>> parse(s, fuzzy=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 697, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/usr/lib/pymodules/python2.7/dateutil/parser.py", line 303, in parse
    raise ValueError, "unknown string format"
ValueError: unknown string format

关于如何解析长字符串中的所有日期的任何想法?理想情况下,将创建一个列表,但是如果需要,我可以自己处理.

Any thoughts on how to parse all dates from a long string? Ideally, a list would be created, but I can handle that myself if I need to.

我正在使用Python,但在这一点上,如果其他语言能够完成工作,那么其他语言可能还可以.

I'm using Python, but at this point, other languages are probably OK, if they get the job done.

PS-我想我可以在中间递归地分割输入文件,然后尝试再试一次,直到它起作用为止,但这真是一个骇客.

PS - I guess I could recursively split the input file in the middle and try, try again until it works, but it's a hell of a hack.

推荐答案

看看它,最简单的方法是修改dateutil

Looking at it, the least hacky way would be to modify dateutil parser to have a fuzzy-multiple option.

parser._parse接收您的字符串,将其用_timelex标记化,然后将这些标记与parserinfo中定义的数据进行比较.

parser._parse takes your string, tokenizes it with _timelex and then compares the tokens with data defined in parserinfo.

此处,如果令牌与parserinfo中的任何内容都不匹配,除非fuzzy为True,否则解析将失败.

Here, if a token doesn't match anything in parserinfo, the parse will fail unless fuzzy is True.

我建议您在没有任何经过处理的时间标记的情况下允许不匹配,然后当您遇到不匹配的情况时,请在此时处理已解析的数据,然后再次开始寻找时间标记.

What I suggest you allow non-matches while you don't have any processed time tokens, then when you hit a non-match, process the parsed data at that point and start looking for time tokens again.

不要花太多力气.

更新

正在等待补丁发布时...

While you're waiting for your patch to get rolled in...

这有点hacky,在库中使用非公共函数,但不需要修改库,也不是反复试验.如果您有任何可以转换为浮点数的单独令牌,则可能会产生误报.您可能需要对结果进行更多过滤.

This is a little hacky, uses non-public functions in the library, but doesn't require modifying the library and is not trial-and-error. You might have false positives if you have any lone tokens that can be turned into floats. You might need to filter the results some more.

from dateutil.parser import _timelex, parser

a = "I like peas on 2011-04-23, and I also like them on easter and my birthday, the 29th of July, 1928"

p = parser()
info = p.info

def timetoken(token):
  try:
    float(token)
    return True
  except ValueError:
    pass
  return any(f(token) for f in (info.jump,info.weekday,info.month,info.hms,info.ampm,info.pertain,info.utczone,info.tzoffset))

def timesplit(input_string):
  batch = []
  for token in _timelex(input_string):
    if timetoken(token):
      if info.jump(token):
        continue
      batch.append(token)
    else:
      if batch:
        yield " ".join(batch)
        batch = []
  if batch:
    yield " ".join(batch)

for item in timesplit(a):
  print "Found:", item
  print "Parsed:", p.parse(item)

收益:

Found: 2011 04 23
Parsed: 2011-04-23 00:00:00
Found: 29 July 1928
Parsed: 1928-07-29 00:00:00


Dieter的更新

Dateutil 2.1似乎是为了与python3兼容而编写的,并使用一个名为six的兼容性"库.某件事不正确,也没有将str对象视为文本.

Dateutil 2.1 appears to be written for compatibility with python3 and uses a "compatability" library called six. Something isn't right with it and it's not treating str objects as text.

如果您将字符串作为unicode或类似文件的对象传递,则此解决方案可与dateutil 2.1一起使用:

This solution works with dateutil 2.1 if you pass strings as unicode or as file-like objects:

from cStringIO import StringIO
for item in timesplit(StringIO(a)):
  print "Found:", item
  print "Parsed:", p.parse(StringIO(item))

如果要在parserinfo上设置选项,请实例化一个parserinfo并将其传递给parser对象.例如:

If you want to set option on the parserinfo, instantiate a parserinfo and pass it to the parser object. E.g:

from dateutil.parser import _timelex, parser, parserinfo
info = parserinfo(dayfirst=True)
p = parser(info)

这篇关于如何使用Python(或其他语言)从文本块中解析多个日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆