如何提取日期,比如说,“2016年1月16日”从正则表达式的大块文本? [英] How do I extract the date, say, "january 16, 2016" from a large chunk of text with regex?
问题描述
我正在使用着名的OCR库从我的扫描仪中取出一些账单。它非常好,并将它找到的所有文本作为一个填充了OCR文本的大字符串返回。
在账单顶部附近,有一条线那说
Hi,
I am OCRing some bills from my scanner with a well-known OCR library. It's very good, and returns all the text it finds as a big string filled with the OCR'd text.
Near the top of the bill, there is a line that says
January 16, 2016
我尝试将输出拆分成行,但是每个帐单的行不同,总是在相同的< long month>,< day number>,< four-digit year>格式。
什么是正则表达式我可以用来咀嚼文本,并选择那种格式的日期?
我尝试了什么:
我谷歌搜索和搜索但我可能没有使用正确的搜索。任何提示都会有所帮助!
I have tried splitting the output into lines, but it's on a different line for each bill, always in the same <long month>, <day number>, <four-digit year> format.
What is a Regex I can use to munch on the text, and pick out the date in that format?
What I have tried:
I've google searched and searched but I am probably not using the right searches. Any tips would help!
推荐答案
首先,抓住Expresso的副本。
你的正则表达式可能会像
/ January | February | .... | December\s * \d {1,2} \ * *,\ s * \d {4} /
我已经抛入\s *
s,因此它可以容忍可变数量的空白(根据我对OCR的体验。
随意随意咀嚼这个。
编辑:oops!删除虚假[]
edit2:更正了E x的拼写 presso
First thing, grab a copy of Expresso.
Your regex will probably wind up something like
/January|February|....|December\s*\d{1,2}\s*,\s*\d{4}/
I've thrown in the\s*
s so it will be tolerant of variable amounts of whitespace (from my experience of OCR).
Feel free to munch this up however you like.
edit: oops! removed spurious []
edit2: corrected spelling of Expresso
这篇关于如何提取日期,比如说,“2016年1月16日”从正则表达式的大块文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!