如何从收据中提取相关信息 [英] How to extract relevant information from receipt

查看:170
本文介绍了如何从收据中提取相关信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Opencv,Tesseract和Keras的组合从一系列不同的收据中提取信息.该项目的最终结果是,我应该能够使用电话拍摄一张收据的图片,然后从该图片中获得商店名称,付款方式(卡或现金),付款金额和更改的投标书.

I am trying to extract information from a range of different receipts using a combination of Opencv, Tesseract and Keras. The end result of the project is that I should be able to take a picture of a receipt using a phone and from that picture get the store name, payment type (card or cash), amount paid and change tendered.

到目前为止,我已经使用Opencv对一系列不同的样品收据执行了一些不同的预处理步骤,例如去除背景,去噪并转换为二进制图像,并留下了如下图像:

So far I have done a few different preprocessing steps on a series of different sample receipts using Opencv such as removing background, denoising and converting to a binary image and am left with an image such as the following:

然后,我正在使用Tesseract对收据进行ocr并将结果写到文本文件中.我已经设法使ocr在可接受的水平上执行,因此我现在可以对收据进行拍照并在其上运行我的程序,然后我将获得一个文本文件,其中包含收据上的所有文本.

I am then using Tesseract to perform ocr on the receipt and write the results out to a text file. I have managed to get the ocr to perform at an acceptable level, so I can currently take a picture of a receipt and run my program on it and I will get a text file containing all the text on the receipt.

我的问题是我不需要收据上的所有文字,我只想要某些信息,例如上面列出的参数.我不确定如何进行训练以提取所需数据的模型.

My problem is that I don't want all of the text on the receipt, I just want certain information such as the parameters I listed above. I am unsure as to how to go about training a model that will extract the data I need.

我是否正确地认为应该使用Keras对图像的不同部分进行分割和分类,然后将模型已分类为包含相关数据的部分中的文本写入文件?还是我需要做的更好的解决方案?

Am I correct in thinking that I should use Keras to segment and classify different sections of the image, and then write to file the text in sections that my model has classified as containing relevant data? Or is there a better solution for what I need to do?

很抱歉,如果这是一个愚蠢的问题,这是我的第一个Opencv/机器学习项目,距离我的深度还很远.任何建设性的批评将不胜感激.

Sorry if this is a stupid question, this is my first Opencv/machine learning project and I'm pretty far out of my depth. Any constructive criticism would be much appreciated.

推荐答案

我的回答并不像现在的时尚那么花哨,但我认为它对您有用,特别是如果是针对产品(不是针对研究和出版目的).

My answer isn't as fancy as what's in fashion right now, but I think it works in your case, specially if this is for a product (not for research & publication purposes).

我将实施文件重新审视文本/图形分离.我已经在Matlab&中实现了它C ++,我保证根据您的描述,它不会花费您很长时间.总结:

I would implement the paper Text/Graphics Separation Revisited. I have already implemented it in both Matlab & C++ and I guarantee from your description it won't take you long. In summary:

  1. 获取所有具有统计信息的连接组件.您对每个字符的边界框特别感兴趣.

  1. Get all connected components with stats. You're specially interested in the bounding box for each character.

本文从直方图上获得了所连接组件的属性的阈值,这使其具有较强的鲁棒性.在连接的组件的几何特性上使用这些阈值(效果非常好),丢弃所有不是字符的东西.

The paper obtains thresholds from histograms on the properties of your connected components, which makes it a bit robust. Using these thresholds (that work surprisingly well) on the geometrical properties of your connected components, discard anything that's not a character.

对于您的角色,获取所有边界框的质心,并根据自己的标准(高度,垂直位置,欧氏距离等)对紧密质心进行分组.使用获得的质心簇创建矩形文本区域.

For your characters, get the centroid for all of their bounding boxes and group the close centroids by your own criteria (height, vertical position, euclidean distance, etc.). Use the obtained centroid clusters to create rectangular text regions.

关联具有相同高度和垂直位置的文本区域.

Associate text regions of same height and vertical position.

在您的文本区域上运行OCR并查找诸如现金"之类的关键字.老实说,我认为您可以摆脱带有文本文件的字典的束缚,而从完成移动设备的计算机视觉后,我知道您的资源是有限的(也受隐私保护).

Run OCR on your text regions and look for keywords like "Cash". I honestly think you can get away with having dictionaries with text files, and from having done computer vision for mobile I know your resources are limited (by privacy too).

老实说,我认为神经网络不会比某种关键字匹配(例如,使用Levenshtein距离或类似的方法来增加一些鲁棒性)好得多,因为无论如何,您都需要手动创建和标记这些单词,创建您的训练数据集,所以...为什么不直接写下来呢?

I honestly don't think a neural net will be much better than some kind of keyword matching (e.g. using Levenshtein distance or something similar to add a bit of robustness) because you will need to manually create and label these words anyway to create your training dataset, so... Why not just write them down instead?

基本上就是这样.您最终得到的东西非常快(特别是如果您想使用电话并且不能将图像发送到服务器),它就可以正常工作.无需机器学习,因此也不需要数据集.

That's basically it. You end up with something very fast (specially if you want to use a phone and you can't send images to a server) and it just works. No machine learning needed, so no dataset needed either.

但是如果这是去学校的话...对不起,我太没礼貌了.请使用TensorFlow以及10,000个带有手动标签的收据图像和自然语言处理方法,您的教授会很高兴.

这篇关于如何从收据中提取相关信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆