为什么 Pytesseract 不能识别黑底白字? [英] Why can't Pytesseract recognize plain white text on black?

查看:27
本文介绍了为什么 Pytesseract 不能识别黑底白字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多像下面这样的图像,我需要使用 pytesseract 来抓取白色文本:

I have a lot of images like below that I need to use pytesseract with to grab the white text:

我使用以下代码,但结果并不令人印象深刻:

I use the following code, but the results are not impressive:

import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
im = Image.open('topLine.png')
print pytesseract.image_to_string(im)

结果:

Rouse Services | Renta Dastbonrd | Blei Rental



RJ |G | B (mmm @

所以我认为原因是图像中的非文本.我用对我来说最重要的文本裁剪了图像的一部分,并对其运行了相同的代码:

So I thought the reason was non-text inside the image. I cropped the part of the image with the most important text to me and ran the same code against it:

然而,我得到的只是空白.Pytesseract 根本没有找到任何文字.我做错了什么?

However, all I got was blank. Pytesseract didn't find any text at all. What am I doing wrong?

推荐答案

要回答您最初的问题,我相信他们的训练数据集仅在黑色文本白色背景上,因此机器学习算法不会反演也就不足为奇了.现在对于解决方案,如果带有白色文本的黑框每次都在图像中的特定位置,我会将其裁剪掉,将其反转,然后将其放回同一位置.否则,您可以使用带有自定义内核的侵蚀/扩张工具来查找这些黑匣子,并基本上在图像的该部分上创建遮罩.使用此掩码,您可以说嘿python,这是一个带有白色文本的黑框.根据我的经验,pytesseract 总是需要至少一些图像处理(如果不是很多)才能获得良好的输出,但即使是最糟糕的图像,我也能够获得 93% 以上的准确率.

To answer your original question is I believe their training dataset is only on black text white background so its not surprising the machine learning algorithm wont pick up the inverse. Now for the solution, if the black box with white text is in a specific spot in the images every time, i would just crop it out, inverse it, then put it back in the same spot. otherwise you can use erode/dilate tools with a customized kernel to find these black boxes and essentially create a masking over that part of the image. Using this masking you can say hey python, here is a black box with white text inverse it. In my experience, pytesseract has always needed at least some image processing (if not alot) to get good output, but even with the most screwed up images i have been able to get accuracies above 93%.

这篇关于为什么 Pytesseract 不能识别黑底白字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆