Python-Pytesseract从图像中提取不正确的文本 [英] Python - Pytesseract extracts incorrect text from image
问题描述
我在Python中使用以下代码从图像中提取文本,
I used the below code in Python to extract text from image,
import cv2
import numpy as np
import pytesseract
from PIL import Image
# Path of working folder on Disk
src_path = "<dir path>"
def get_string(img_path):
# Read image with opencv
img = cv2.imread(img_path)
# Convert to gray
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# Write image after removed noise
cv2.imwrite(src_path + "removed_noise.png", img)
# Apply threshold to get image with only black and white
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
# Write the image after apply opencv to do some ...
cv2.imwrite(src_path + "thres.png", img)
# Recognize text with tesseract for python
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))
# Remove template file
#os.remove(temp)
return result
print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")
print "------ Done -------"
但是输出不正确.输入文件是
But the output is incorrect.. The input file is,
收到的输出是"0001",而不是"D001"
The output received is '0001' instead of 'D001'
收到的输出是'3001'而不是'B001'
The output received is '3001' instead of 'B001'
从图像中检索正确的字符,以及训练pytesseract返回图像中所有字体类型的正确字符(包括粗体字符)所需的代码更改是什么
What is the required code changes to retrieve the right Characters from image, also to train the pytesseract to return the right characters for all font types in image[including Bold characters]
推荐答案
@Maaaaa指出了Tessearact识别文本错误的确切原因.
@Maaaaa has pointed out the exact reason for incorrect text recognition by Tessearact.
但是仍然可以通过在tesseract输出上应用一些后处理步骤来改善最终输出.如果有帮助,您可以考虑并使用以下几点:
But still you can improve your final output by applying some post processing steps on the tesseract output. Here are a few points that you can think about and use them if it helps:
- 尝试在Tesseract输入参数中禁用字典检查功能.
- 使用数据集中基于启发式的信息.从给定的示例图像中,我猜每个单词/序列的第一个字符是一个字母,因此您可以根据数据集用最可能的字母替换输出中的第一个数字, 例如,"0"可以替换为D,因此"0001"->"D001",对于其他情况也是如此.
- Tesseract还提供了字符级别识别置信度值,因此请使用该信息用具有最高置信度值的字符替换字符.
- Try disabling the dictionary check feature in Tesseract input parameters.
- Use heuristic based information from your dataset. From the given sample images in question, i guess first character of each word/sequence is an alphabet so you can replace first digit in your output with most probable alphabet based on your dataset, for example '0' can be replaced with D so '0001' -> 'D001', similarly for other cases too.
- Tesseract also provides the character level recognition confidence value, so use that information to replace the characters with the one having highest confidence value.
这篇关于Python-Pytesseract从图像中提取不正确的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!