通过 pytesseract & 提高文本识别的准确性PIL [英] Increase Accuracy of text recognition through pytesseract & PIL

查看:79
本文介绍了通过 pytesseract & 提高文本识别的准确性PIL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我想从图像中提取文本.并且由于图像的质量和大小不佳,因此给出的结果不准确.我尝试使用 PIL 进行了一些增强和其他操作,但这只会降低图像质量.

So I am trying to extract text from image. And as the quality and size of image is not good, it is giving inaccurate results. I tried few enhancements and other things with PIL but that is only worsening the quality of image.

有人可以建议对图像进行一些增强以获得更好的结果.图片示例:

Can someone suggest some enhancement in image to get better results. Few Examples of images:

推荐答案

在提供的图像示例中,文本在视觉上的质量非常好,所以问题是 OCR 给出的结果不准确是怎么回事?

In the provided example of image the text is visually of quite good quality, so the question is how it comes that OCR gives inaccurate results?

为了说明这个答案的进一步文本中给出的结论,让我们运行给定的图像

To illustrate the conclusions given in further text of this answer let's run the the given image

通过 Tesseract.下面是 Tesseract OCR 的结果:

through Tesseract. Below the result of Tesseract OCR:

"fhpgearedmomrs©gmachom"

现在让我们将图像大小调整四次并对其应用阈值.我已经在 Gimp 中手动完成了调整大小和阈值处理,但是使用适当的 PIL 调整大小方法和阈值,它肯定可以自动进行,以便在增强后您得到与我得到的增强图像类似的图像:

Now let's resize the image four times and apply thresholding to it. I have done the resizing and thresholding manually in Gimp, but with appropriate resizing method and threshold value for PIL it can be for sure automated, so that after the enhancement you get an image similar to the enhanced image I have got:

通过 Tesseract OCR 运行的改进图像提供以下文本:

The improved image run through Tesseract OCR gives following text:

"fhpgearedmotors©gmail.com"

"fhpgearedmotors©gmail.com"

这表明放大图像有助于在提供的文本图像示例上实现 100% 的准确率.

This demonstrates that enlarging an image can help to achieve 100% accuracy on the provided text-image example.

放大图像有助于实现更高的 OCR 准确性可能看起来很奇怪,但是... OCR 的开发目的是将打印媒体的扫描转换为文本,并期望通过设计获得 300 dpi 的文本图像.这解释了为什么一些 OCR 程序没有自行调整文本大小以改善其结果,并且在希望通过放大实现更高 dpi 图像分辨率的小字体上表现不佳.

It may appear weird that enlarging an image helps to achieve better OCR accuracy, BUT ... OCR was developed to convert scans of printed media to texts and expect 300 dpi images of the text by design. This explains why some OCR programs didn't resize the text by themselves to improve their results and do bad on small fonts expecting higher dpi resolution of the image which can be achieved by enlarging.

这里摘录自 github.com 上的 Tesseract FAQ 证明了上述陈述:

Here an excerpt from Tesseract FAQ on github.com prooving the statement above:

[对于合理的准确性,有一个最小文本大小.您必须考虑分辨率和点大小.精度在 10pt x 300dpi 以下下降,在 8pt x 300dpi 以下迅速下降.快速检查是计算字符 x 高度的像素.(X 高度是小写字母 x 的高度.)在 10pt x 300dpi 时,x 高度通常约为 20 像素,尽管这可能因字体而异.低于 10 像素的 x 高度,您获得准确结果的机会很小,低于约 8 像素,大部分文本将被去除噪声".]

[There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be "noise removed".]

这篇关于通过 pytesseract & 提高文本识别的准确性PIL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆