图像处理,以提高tesseract OCR的准确性 [英] image processing to improve tesseract OCR accuracy

查看:87
本文介绍了图像处理,以提高tesseract OCR的准确性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用tesseract将文档转换为文本。文档的质量范围非常广泛,我正在寻找有关哪种图像处理可能会改善结果的提示。我注意到高度像素化的文本 - 例如由传真机生成的文本 - 对于tesseract来说特别难以处理 - 可能是角色的所有锯齿状边缘都会混淆形状识别算法。

I've been using tesseract to convert documents into text. The quality of the documents ranges wildly, and I'm looking for tips on what sort of image processing might improve the results. I've noticed that text that is highly pixellated - for example that generated by fax machines - is especially difficult for tesseract to process - presumably all those jagged edges to the characters confound the shape-recognition algorithms.

哪种图像处理技术可以提高准确度?我一直在使用高斯模糊来平滑像素化图像并看到一些小的改进,但我希望有一种更具体的技术可以产生更好的结果。假设一个过滤器被调整为黑白图像,可以平滑不规则的边缘,然后是一个过滤器,可以增加对比度,使角色更加清晰。

What sort of image processing techniques would improve the accuracy? I've been using a Gaussian blur to smooth out the pixellated images and seen some small improvement, but I'm hoping that there is a more specific technique that would yield better results. Say a filter that was tuned to black and white images, which would smooth out irregular edges, followed by a filter which would increase the contrast to make the characters more distinct.

对于图像处理新手的任何一般提示?

Any general tips for someone who is a novice at image processing?

推荐答案


  1. 修复DPI(如果需要)300 DPI最低

  2. 修复文字大小(例如12磅应该没问题)

  3. 尝试修复文本行(去偏移和去除文本)

  4. 尝试修复图像的照明(例如,没有图像的暗部分)

  5. 二值化和去噪图像

  1. fix DPI (if needed) 300 DPI is minimum
  2. fix text size (e.g. 12 pt should be ok)
  3. try to fix text lines (deskew and dewarp text)
  4. try to fix illumination of image (e.g. no dark part of image)
  5. binarize and de-noise image

没有适用于所有情况的通用命令行(有时您需要模糊和锐化图像)。但是你可以尝试从Fred的ImageMagick脚本中 TEXTCLEANER

There is no universal command line that would fit to all cases (sometimes you need to blur and sharpen image). But you can give a try to TEXTCLEANER from Fred's ImageMagick Scripts.

如果您不是命令行的粉丝,也许您可​​以尝试使用opensource scantailor。 sourceforge.net 或商业 bookrestorer

If you are not fan of command line, maybe you can try to use opensource scantailor.sourceforge.net or commercial bookrestorer.

这篇关于图像处理,以提高tesseract OCR的准确性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆