从图像中读取文本 [英] Reading text from image
问题描述
有关将这些图像转换为文字的建议吗?我正在使用pytesseract,除了这个以外,它在大多数情况下都能很好地工作。理想情况下,我会完全阅读这些数字。最糟糕的情况我可以尝试使用PIL来确定'/'左边的数字是否为零。从左边开始找到第一个白色像素,然后
来自PIL导入图片
来自pytesseract import image_to_string
myText = image_to_string(Image.open(tmp / test.jpg),config =' - psm 10')
myText = image_to_string(Image.open(tmp / test.jpg))
中间的斜线会导致问题。我也尝试使用PIL的'.paste'在图像周围添加大量额外的黑色。我可能会尝试其他一些PIL技巧,但除非必须,否则我宁愿不去那条路。
我尝试使用config =' - psm 10',但我的8'有时会以:和其他时间的随机字符出现。而我的0都没有成功。
参考:
1BJ2I]
DIS
10.I'10
20.I20
所以我正在做一些似乎现在正在工作的伏都教转换。但看起来很容易出错:
def ConvertPPTextToReadableNumbers(text):
text = RemoveNonASCIICharacters(text)
text = text.replace(I],0)
text = text.replace(|],0)
text = text.replace(l], 0)
text = text.replace(B,8)
text = text.replace(D,0)
text = text.replace( S,5)
text = text.replace(。I',/)
text = text.replace(。I,/)
text = text.replace(我,/)
text = text.replace(J,/)
返回文本
最终生成:
ConvertPPTextToReadableNumbers返回text = 18 / 20
ConvertPPTextToReadableNumbers返回text = 0/5
ConvertPPTextToReadableNumbers返回text = 10/10
ConvertPPTextToReadableNumbers返回text = 20/20
一般来说,大多数OCR工具(如Tesseract)都经过调整用于处理印刷文本的高分辨率扫描。它们在低分辨率或像素化图像上表现不佳。
这里有两种可能的方法:
-
如果图像的字体,背景和布局完全可预测,则根本不需要Tesseract;这只会使问题复杂化。建立一个代表您需要识别的每个角色的图像库,并检查图像的某些部分是否为等于参考图像。
-
如果这不是一个选项,或者看起来太难了,你可以使用它来升级像素化图像 hq * x算法之一。添加的细节可能足以让Tesseract可靠地识别字符。
Any suggestions on converting these images to text? I'm using pytesseract and it's working wonderfully in most cases except this. Ideally I'd read these numbers exactly. Worst case I can just try to use PIL to determine if the number to the left of the '/' is a zero. Start from the left and find the first white pixel, then
from PIL import Image
from pytesseract import image_to_string
myText = image_to_string(Image.open("tmp/test.jpg"),config='-psm 10')
myText = image_to_string(Image.open("tmp/test.jpg"))
The slash in the middle causes issues here. I've also tried using PIL's '.paste' to add lots of extra black around the image. There might be a few other PIL tricks I could try, but i'd rather not go that route unless I have to.
I tried using config='-psm 10' but my 8's were coming through as ":" sometimes, and random characters other times. And my 0's were coming through as nothing.
Reference to: pytesseract don't work with one digit image for the -psm 10
_____________EDIT_______________ Additional samples:
1BJ2I]
DIS
10.I'10
20.I20
So I'm doing some voodoo conversions that seem to be working for now. But looks very error prone:
def ConvertPPTextToReadableNumbers(text):
text = RemoveNonASCIICharacters(text)
text = text.replace("I]", "0")
text = text.replace("|]", "0")
text = text.replace("l]", "0")
text = text.replace("B", "8")
text = text.replace("D", "0")
text = text.replace("S", "5")
text = text.replace(".I'", "/")
text = text.replace(".I", "/")
text = text.replace("I'", "/")
text = text.replace("J", "/")
return text
Ultimately generates:
ConvertPPTextToReadableNumbers return text = 18/20
ConvertPPTextToReadableNumbers return text = 0/5
ConvertPPTextToReadableNumbers return text = 10/10
ConvertPPTextToReadableNumbers return text = 20/20
Generally speaking, most OCR tools (like Tesseract) are tuned for working with high-resolution scans of printed text. They do not perform well on low-resolution or pixellated images.
Two possible approaches here are:
If the font, background, and layout of your images are completely predictable, you don't need Tesseract at all; it's just complicating matters. Build a library of images representing each character you need to recognize, and check whether parts of the image are equal to the reference image.
If that isn't an option, or if it seems too hard, you could upscale the pixellated image using one of the hq*x algorithms. The added detail may be sufficient to get Tesseract to reliably recognize the characters.
这篇关于从图像中读取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!