OCR应用前的图像清洁 [英] Image cleaning before OCR application

查看:108
本文介绍了OCR应用前的图像清洁的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

过去几个小时我一直在试验PyTesser,这是一个非常好的工具。我注意到有关PyTesser准确性的一些事情:

I have been experimenting with PyTesser for the past couple of hours and it is a really nice tool. Couple of things I noticed about the accuracy of PyTesser:


  1. 带有图标,图像和文字的文件 - 准确率为5-10%

  2. 仅包含文本的文件(图像和图标已删除) - 准确率为50-60%

  3. 拉伸文件(这是最好的部分) - 拉伸文件$ b在x或y轴上方2)中的$ b增加了10-20%的准确度

显然Pytesser没有照顾字体尺寸或图像拉伸。虽然有很多关于图像处理和OCR的理论要阅读,但是在应用PyTesser或其他库之前,是否有任何标准的图像清理程序(除了擦除图标和图像),而不管语言是什么?

So apparently Pytesser does not take care of font dimension or image stretching. Although there is much theory to be read about image processing and OCR, are there any standard procedures of image cleanup(apart from erasing icons and images) that needs to be done before applying PyTesser or other libraries irrespective of the language?

............

...........

哇,这篇文章现在已经很老了。在过去的几天里,我再次开始研究OCR。这次我扔掉了PyTesser并使用了Tesseract引擎和ImageMagik。直截了当,这就是我发现的:

Wow, this post is quite old now. I started my research again on OCR these last couple of days. This time I chucked PyTesser and used the Tesseract Engine with ImageMagik instead. Coming straight to the point, this is what I found:

1) You can increase the resolution with ImageMagic(There are a bunch of simple shell commands you can use)
2) After increasing the resolution, the accuracy went up by 80-90%.

因此,Tesseract Engine毫无疑问是市场上最好的开源OCR引擎。此处不需要事先清洁图像。需要注意的是,它不适用于包含大量嵌入图像的文件,而且我还没有找到一种方法来训练Tesseract忽略它们。此外,图像中的文本布局和格式也有很大的不同。它只适用于带有文本的图像。希望这会有所帮助。

So the Tesseract Engine is without doubt the best open source OCR engine in the market. No prior image cleaning was required here. The caveat is that it does not work on files with a lot of embedded images and I coudn't figure out a way to train Tesseract to ignore them. Also the text layout and formatting in the image makes a big difference. It works great with images with just text. Hope this helped.

推荐答案

不确定您的意图是否用于商业用途,但是如果你的OCR在一堆像图像。

Not sure if your intent is for commercial use or not, But this works wonders if your performing OCR on a bunch of like images.

http://www.fmwconcepts.com/imagemagick/textcleaner/index.php

ORIGINAL

ORIGINAL

使用给定参数进行预处理后。

After Pre-Processing with given arguments.

这篇关于OCR应用前的图像清洁的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆