使用 OpenCV 为 Tesseract OCR 预处理图像 [英] Preprocessing image for Tesseract OCR with OpenCV

查看:87
本文介绍了使用 OpenCV 为 Tesseract OCR 预处理图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试开发一个应用程序,该应用程序使用 Tesseract 从手机摄像头拍摄的文档中识别文本.我正在使用 OpenCV 对图像进行预处理以更好地识别,应用高斯模糊和阈值方法进行二值化,但结果非常糟糕.

I'm trying to develop an App that uses Tesseract to recognize text from documents taken by a phone's cam. I'm using OpenCV to preprocess the image for better recognition, applying a Gaussian blur and a Threshold method for binarization, but the result is pretty bad.

我可以使用哪些其他过滤器来使 Tesseract 的图像更具可读性?

What others filter can I use to make the image more readable for Tesseract?

推荐答案

我在这里描述了一些为 Tesseract 准备图像的技巧:使用tesseract识别车牌

I described some tips for preparing images for Tesseract here: Using tesseract to recognize license plates

在你的例子中,有几件事情正在发生......

In your example, there are several things going on...

您需要将文本设为黑色,将图像的其余部分设为白色(不是相反).这就是调整字符识别的原因.灰度就可以了,只要背景多为全白,文字多为全黑;文本的边缘可能是灰色的(抗锯齿),这可能有助于识别(但不一定 - 您必须进行实验)

You need to get the text to be black and the rest of the image white (not the reverse). That's what character recognition is tuned on. Grayscale is ok, as long as the background is mostly full white and the text mostly full black; the edges of the text may be gray (antialiased) and that may help recognition (but not necessarily - you'll have to experiment)

您看到的问题之一是,在图像的某些部分,文本确实薄"(并且在阈值处理后会出现字母中的间隙),而在其他部分则非常厚"(和字母开始合并).Tesseract 不会喜欢这样 :) 发生这种情况是因为输入图像的光照不均匀,所以单个阈值并不适用于任何地方.解决方案是进行局部自适应阈值化",其中为图像的每个邻域计算不同的阈值.有很多方法可以做到这一点,但请查看示例:

One of the issues you're seeing is that in some parts of the image, the text is really "thin" (and gaps in the letters show up after thresholding), while in other parts it is really "thick" (and letters start merging). Tesseract won't like that :) It happens because the input image is not evenly lit, so a single threshold doesn't work everywhere. The solution is to do "locally adaptive thresholding" where a different threshold is calculated for each neighbordhood of the image. There are many ways of doing that, but check out for example:

您遇到的另一个问题是线条不直.根据我的经验,Tesseract 可以处理非常有限 程度的非直线(透视失真、倾斜或歪斜的百分之几),但它实际上不适用于波浪 线.如果可以,请确保源图像具有直线 :) 不幸的是,对此没有简单的现成答案;您必须查看研究文献并自己实现一种最先进的算法(如果可能,将其开源 - 确实需要为此提供开源解决方案).谷歌学术搜索曲线OCR提取" 会让你开始,例如:

Another problem you have is that the lines aren't straight. In my experience Tesseract can handle a very limited degree of non-straight lines (a few percent of perspective distortion, tilt or skew), but it doesn't really work with wavy lines. If you can, make sure that the source images have straight lines :) Unfortunately, there is no simple off-the-shelf answer for this; you'd have to look into the research literature and implement one of the state of the art algorithms yourself (and open-source it if possible - there is a real need for an open source solution to this). A Google Scholar search for "curved line OCR extraction" will get you started, for example:

最后:我认为与使用 C++ 中的 OpenCV 相比,使用 Python 生态系统(ndimage、skimage)会做得更好.OpenCV python 包装器适用于简单的东西,但对于你想要做的事情,它们不会完成这项工作,你需要获取许多 OpenCV 中没有的部分(当然你可以混合搭配).在 C++ 中实现曲线检测之类的东西比在 python 中花费的时间长一个数量级(* 即使你不知道 python,也是如此).

Lastly: I think you would do much better to work with the python ecosystem (ndimage, skimage) than with OpenCV in C++. OpenCV python wrappers are ok for simple stuff, but for what you're trying to do they won't do the job, you will need to grab many pieces that aren't in OpenCV (of course you can mix and match). Implementing something like curved line detection in C++ will take an order of magnitude longer than in python (* this is true even if you don't know python).

祝你好运!

这篇关于使用 OpenCV 为 Tesseract OCR 预处理图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆