空字符串与Tesseract [英] Empty string with Tesseract

查看:158
本文介绍了空字符串与Tesseract的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从一个大文件中读取不同的裁剪图像,并且设法读取其中的大多数图像,但是当我尝试使用tesseract读取它们时,有些图像会返回一个空字符串.

I'm trying to read different cropped images from a big file and I manage to read most of them but there are some of them which return an empty string when I try to read them with tesseract.

代码就是这一行:

pytesseract.image_to_string(cv2.imread("img.png"), lang="eng")

我能尝试读取这些图像吗?

Is there anything I can try to be able to read these kind of images?

预先感谢

推荐答案

在将图像传递给pytesseract之前对图像进行阈值处理可以提高准确性.

Thresholding the image before passing it to pytesseract increases the accuracy.

import cv2
import numpy as np

# Grayscale image
img = Image.open('num.png').convert('L')
ret,img = cv2.threshold(np.array(img), 125, 255, cv2.THRESH_BINARY)

# Older versions of pytesseract need a pillow image
# Convert back if needed
img = Image.fromarray(img.astype(np.uint8))

print(pytesseract.image_to_string(img))

此打印输出

5.78 / C02

仅对第二张图像进行阈值处理会返回11.1.另一个可以帮助您的步骤是设置页面细分模式将图像作为单个文本行处理".使用配置--psm 7.在第二张图像上执行此操作将返回11.1 "202 ',引号来自顶部的部分文本.要忽略这些字符,您还可以通过config -c tessedit_char_whitelist=0123456789.%设置要使用白名单搜索的字符.一切都在一起:

Doing just thresholding on the second image returns 11.1. Another step that can help is to set the page segmentation mode to "Treat the image as a single text line." with the config --psm 7. Doing this on the second image returns 11.1 "202 ', with the quotation marks coming from the partial text at the top. To ignore those, you can also set what characters to search for with a whitelist by the config -c tessedit_char_whitelist=0123456789.%. Everything together:

pytesseract.image_to_string(img, config='--psm 7 -c tessedit_char_whitelist=0123456789.%')

这将返回11.1 202.显然,pytesseract在使用该百分比符号时遇到了困难,我不确定如何通过图像处理或配置更改来改善这一点.

This returns 11.1 202. Clearly pytesseract is having a hard time with that percent symbol, which I'm not sure how to improve on that with image processing or config changes.

这篇关于空字符串与Tesseract的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆