使用OpenCV预处理Tesseract OCR的图像 [英] Preprocessing image for Tesseract OCR with OpenCV

查看:962
本文介绍了使用OpenCV预处理Tesseract OCR的图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试开发一个使用Tesseract来识别手机摄像头拍摄的文本的应用程序。我正在使用OpenCV预处理图像以便更好地识别,应用高斯模糊和阈值方法进行二值化,但结果非常糟糕。

I'm trying to develop an App that uses Tesseract to recognize text from documents taken by a phone's cam. I'm using OpenCV to preprocess the image for better recognition, applying a Gaussian blur and a Threshold method for binarization, but the result is pretty bad.

这里是我用于测试的图像:

Here is the the image I'm using for tests:

此处预处理图片

我可以使用其他过滤器来使Tesseract的图像更具可读性吗?

What others filter can I use to make the image more readable for Tesseract?

推荐答案

我描述了一些在这里为Tesseract准备图像的提示:
使用tesseract识别车牌

I described some tips for preparing images for Tesseract here: Using tesseract to recognize license plates

在你的例子中,有几件事正在发生......

In your example, there are several things going on...

你需要将文本设为黑色,其余部分为黑色te(不是相反)。这就是字符识别的调整。灰度 好,只要背景大部分为全白,文字大部分为全黑;文本的边缘可能是灰色的(抗锯齿),可能有助于识别(但不一定 - 你必须进行实验)

You need to get the text to be black and the rest of the image white (not the reverse). That's what character recognition is tuned on. Grayscale is ok, as long as the background is mostly full white and the text mostly full black; the edges of the text may be gray (antialiased) and that may help recognition (but not necessarily - you'll have to experiment)

你看到的一个问题是,在图像的某些部分,文本非常薄(字母中的间隙在阈值处显示),而在其他部分,它实际上是厚(并且字母开始合并)。 Tesseract不会喜欢:)它发生的原因是输入图像不均匀点亮,因此单个阈值无处不在。解决方案是进行局部自适应阈值处理,其中针对图像的每个邻居计算不同的阈值。有很多种方法,但请查看:

One of the issues you're seeing is that in some parts of the image, the text is really "thin" (and gaps in the letters show up after thresholding), while in other parts it is really "thick" (and letters start merging). Tesseract won't like that :) It happens because the input image is not evenly lit, so a single threshold doesn't work everywhere. The solution is to do "locally adaptive thresholding" where a different threshold is calculated for each neighbordhood of the image. There are many ways of doing that, but check out for example:

  • Adaptive gaussian thresholding in OpenCV with cv2.adaptiveThreshold(...,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,...)
  • Local Otsu's method
  • Local adaptive histogram equalization

另一个问题是线条不直。根据我的经验,Tesseract可以处理非常有限的程度的非直线(百分之几的透视失真,倾斜或倾斜),但它并不适用于波浪线条。如果可以,请确保源图像有直线:)不幸的是,没有简单的现成答案;你必须自己研究一下研究文献并实现一种最先进的算法(如果可能的话,开源它 - 真的需要一个开源解决方案)。 Google学术搜索搜索曲线OCR提取将帮助您入门,例如:

Another problem you have is that the lines aren't straight. In my experience Tesseract can handle a very limited degree of non-straight lines (a few percent of perspective distortion, tilt or skew), but it doesn't really work with wavy lines. If you can, make sure that the source images have straight lines :) Unfortunately, there is no simple off-the-shelf answer for this; you'd have to look into the research literature and implement one of the state of the art algorithms yourself (and open-source it if possible - there is a real need for an open source solution to this). A Google Scholar search for "curved line OCR extraction" will get you started, for example:

  • Text line Segmentation of Curved Document Images

最后:我认为使用python生态系统(ndimage,skimage)比使用C ++中的OpenCV更好。 OpenCV python包装器对于简单的东西是可以的,但是对于你想要做的事情,他们将无法完成这项工作,你将需要抓取许多不在OpenCV中的部分(当然你可以混合搭配)。在C ++中实现像曲线检测这样的东西比python要长一个数量级(*即使你不懂python也是如此)。

Lastly: I think you would do much better to work with the python ecosystem (ndimage, skimage) than with OpenCV in C++. OpenCV python wrappers are ok for simple stuff, but for what you're trying to do they won't do the job, you will need to grab many pieces that aren't in OpenCV (of course you can mix and match). Implementing something like curved line detection in C++ will take an order of magnitude longer than in python (* this is true even if you don't know python).

祝你好运!

这篇关于使用OpenCV预处理Tesseract OCR的图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆