从字符串(日期解析,NLP)中提取格式不一致的日期 [英] Extract inconsistently formatted date from string (date parsing, NLP)

查看:773
本文介绍了从字符串(日期解析,NLP)中提取格式不一致的日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的文件列表,其中一些文件名嵌入日期。日期的格式不一致,通常不完整,例如2008年8月,2006年8月,2006年8月,08-06,01-08-06,2006,011004等。除此之外,一些文件名有无关数日期,例如20202010。



简而言之,日期通常不完整,有时不完整,格式不一致,并嵌入到其他信息的字符串中,例如。 报告Aug06.xls。



有没有任何可用的Perl模块,会做一个体面的工作来猜测这样一个字符串的日期?它不必是100%正确的,因为它将被人工手动验证,但我正在努力使事情尽可能容易的人,并有成千上万的条目检查:)

解决方案

Date :: Parse绝对会成为你答案的一部分 - 这是一个随机格式化的类似日期的字符串,



您的问题的其他部分 - 您的文件名中的其他字符 - 是不寻常的,您不太可能找到其他人已经为您打包了一个模块。



没有看到更多的示例数据,这只能猜到,但是我首先要确定可能的或可能的日期部分候选人。



这是一个令人讨厌的暴力示例,使用Date :: Parse(更聪明的方法将使用regex-en列表来尝试识别日期位 - 我很高兴燃烧cpu周期,不要那么难以想象!)

 !/ usr / bin / perl 
use strict;
使用警告;
使用Date :: Parse;

我的@files =(Report Aug06.xls,ReportAug2006,Report 11th September 2006.xls,
Annual Report-08-06,End-of -month Report01-08-06.xls,Report2006);

#假设 - 最长的可能日期字符串类似于2006年9月11日 - 19个字符
#最短是2006 - 4个字符。
#brute强制所有字符串从19-4个字符长在文件名的末尾(较少的扩展名)
#返回Date :: Parse识别为日期的最长的东西



foreach我的$文件(@files){
#chop扩展名如果有一个
$ file =〜s / \ .. * //;
为$ len(-19 ..- 4){
我的$ string = substr($ file,$ len);
我的$ time = str2time($ string);
print$ string is a date:$ time =,scalar(localtime($ time)),\\\
if $ time;
last if $ time;
}
}


I have a large list of files, some of which have dates embedded in the filename. The format of the dates is inconsistent and often incomplete, e.g. "Aug06", "Aug2006", "August 2006", "08-06", "01-08-06", "2006", "011004" etc. In addition to that, some filenames have unrelated numbers that look somewhat like dates, e.g. "20202010".

In short, the dates are normally incomplete, sometimes not there, are inconsistently formatted and are embedded in a string with other information, e.g. "Report Aug06.xls".

Are there any Perl modules available which will do a decent job of guessing the date from such a string? It doesn't have to be 100% correct, as it will be verified by a human manually, but I'm trying to make things as easy as possible for that person and there are thousands of entries to check :)

解决方案

Date::Parse is definitely going to be part of your answer - the bit that works out a randomly formatted date-like string and make an actual useable date out of it.

The other part of your problem - the rest of the characters in your filenames - is unusual enough that you're unlikely to find someone else has packaged up a module for you.

Without seeing more of your sample data, it's really only possible to guess, but I'd start by identifying possible or likely "date section" candidates.

Here's a nasty brute-force example using Date::Parse (a smarter approach would use a list of regex-en to try and identify dates-bits - I'm happy to burn cpu cycles to not think quite so hard though!)

!/usr/bin/perl
use strict;
use warnings;
use Date::Parse;

my @files=("Report Aug06.xls", "ReportAug2006", "Report 11th September 2006.xls", 
           "Annual Report-08-06", "End-of-month Report01-08-06.xls", "Report2006");

# assumption - longest likely date string is something like '11th September 2006' - 19 chars
# shortest is "2006" - 4 chars.
# brute force all strings from 19-4 chars long at the end of the filename (less extension)
# return the longest thing that Date::Parse recognises as a date



foreach my $file (@files){
  #chop extension if there is one
  $file=~s/\..*//;
  for my $len (-19..-4){
    my $string = substr($file, $len);
    my $time = str2time($string);
    print "$string is a date: $time = ",scalar(localtime($time)),"\n" if $time;
    last if $time;
    }
  }

这篇关于从字符串(日期解析,NLP)中提取格式不一致的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆