关于抗锯齿文本的OCR [英] OCR on antialiased text

查看:265
本文介绍了关于抗锯齿文本的OCR的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须从PDF文件到OCR表。我编写了简单的Python + opencv脚本来获取单个单元格。在那之后出现了新的问题。文本是抗锯齿的,而不是高质量的。
tesseract的识别率非常低。我试图用自适应阈值处理预处理图像,但结果并没有好多少。
我试过ABBYY FineReader的试用版,确实它提供了很好的输出,但我不想使用非自由软件。
我想知道是否有一些预处理可以解决问题,或者是否有必要编写和学习其他OCR系统。



来优化LCD显示器的效果。



如果是这样,以更高的分辨率提取文本应该非常容易。例如,您可以使用ImageMagick通过使用如下命令行以300 dpi从PDF文件中提取图像:

 转换-density 300 source.pdf output.png 

您甚至可以尝试在自己喜欢的查看器中加载PDF将文本直接复制到剪贴板。






附录:



我尝试将示例文本转换回原始像素并应用评论中提到的缩放技术。结果如下:



原始图片:



缩放300%并应用简单阈值后:



智能缩放和阈值处理之后:



正如您所看到的,有些这些字母仍然有点格格不入,但我认为使用Tesseract读取这个字母的可能性更大。


I have to OCR table from PDF document. I wrote simple Python+opencv script to get individual cells. After that new problem arose. Text is antialiased and not good-quality. Recognition rate of tesseract is very low. I've tried to preprocess images with adaptive thresholding but results weren't much better. I've tried trial version of ABBYY FineReader and indeed it gives fine output, but I don't want to use non-free software. I wonder if some preprocessing would solve issue or is it nessesary to write and learn other OCR system.

http://oi60.tinypic.com/ztzsrq.jpg http://i57.tinypic.com/xmpcm9.png

解决方案

If you look closely at your antialiased text samples, you'll notice that the edges contain a lot of red and blue:

This suggests that the antialiasing is taking place inside your computer, which has used subpixel rendering to optimise the results for your LCD monitor.

If so, it should be quite easy to extract the text at a higher resolution. For example, you can use ImageMagick to extract images from PDF files at 300 dpi by using a command line like the following:

convert -density 300 source.pdf output.png

You could even try loading the PDF in your favourite viewer and copying the text directly to the clipboard.


Addendum:

I tried converting your sample text back into its original pixels and applying the scaling technique mentioned in the comments. Here are the results:

Original image:

After scaling 300% and applying simple threshold:

After smart scaling and thresholding:

As you can see, some of the letters are still a bit malformed, but I think there's a better chance of reading this with Tesseract.

这篇关于关于抗锯齿文本的OCR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆