从字符串中提取格式不一致的日期(日期解析,NLP) [英] Extract inconsistently formatted date from string (date parsing, NLP)

查看:32
本文介绍了从字符串中提取格式不一致的日期(日期解析,NLP)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的文件列表,其中一些文件名中嵌入了日期.日期的格式不一致且通常不完整,例如Aug06"、Aug2006"、August 2006"、08-06"、01-08-06"、2006"、011004"等.除此之外,一些文件名具有无关数字,看起来有点像日期,例如20202010".

I have a large list of files, some of which have dates embedded in the filename. The format of the dates is inconsistent and often incomplete, e.g. "Aug06", "Aug2006", "August 2006", "08-06", "01-08-06", "2006", "011004" etc. In addition to that, some filenames have unrelated numbers that look somewhat like dates, e.g. "20202010".

简而言之,日期通常不完整,有时不存在,格式不一致,并与其他信息一起嵌入字符串中,例如报告Aug06.xls".

In short, the dates are normally incomplete, sometimes not there, are inconsistently formatted and are embedded in a string with other information, e.g. "Report Aug06.xls".

是否有任何可用的 Perl 模块可以很好地从这样的字符串中猜测日期?它不一定是 100% 正确的,因为它会由人工验证,但我正在努力让这个人尽可能简单,并且有数千个条目需要检查:)

Are there any Perl modules available which will do a decent job of guessing the date from such a string? It doesn't have to be 100% correct, as it will be verified by a human manually, but I'm trying to make things as easy as possible for that person and there are thousands of entries to check :)

推荐答案

Date::Parse 肯定会成为您答案的一部分 - 计算出随机格式的类似日期的字符串并生成实际可用日期的位

Date::Parse is definitely going to be part of your answer - the bit that works out a randomly formatted date-like string and make an actual useable date out of it.

问题的另一部分 - 文件名中的其余字符 - 非常不寻常,您不太可能找到其他人为您打包了一个模块.

The other part of your problem - the rest of the characters in your filenames - is unusual enough that you're unlikely to find someone else has packaged up a module for you.

在没有看到更多示例数据的情况下,真的只能猜测,但我会首先确定可能或可能的日期部分"候选对象.

Without seeing more of your sample data, it's really only possible to guess, but I'd start by identifying possible or likely "date section" candidates.

这是一个使用 Date::Parse 的令人讨厌的蛮力示例(更聪明的方法是使用 regex-en 列表来尝试识别日期位 - 我很高兴消耗 cpu 周期而不是想得那么难!)

Here's a nasty brute-force example using Date::Parse (a smarter approach would use a list of regex-en to try and identify dates-bits - I'm happy to burn cpu cycles to not think quite so hard though!)

!/usr/bin/perl
use strict;
use warnings;
use Date::Parse;

my @files=("Report Aug06.xls", "ReportAug2006", "Report 11th September 2006.xls", 
           "Annual Report-08-06", "End-of-month Report01-08-06.xls", "Report2006");

# assumption - longest likely date string is something like '11th September 2006' - 19 chars
# shortest is "2006" - 4 chars.
# brute force all strings from 19-4 chars long at the end of the filename (less extension)
# return the longest thing that Date::Parse recognises as a date



foreach my $file (@files){
  #chop extension if there is one
  $file=~s/..*//;
  for my $len (-19..-4){
    my $string = substr($file, $len);
    my $time = str2time($string);
    print "$string is a date: $time = ",scalar(localtime($time)),"
" if $time;
    last if $time;
    }
  }

这篇关于从字符串中提取格式不一致的日期(日期解析,NLP)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆