Python-Pytesseract从图像中提取不正确的文本 [英] Python - Pytesseract extracts incorrect text from image

查看:442
本文介绍了Python-Pytesseract从图像中提取不正确的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Python中使用以下代码从图像中提取文本,

I used the below code in Python to extract text from image,

import cv2
import numpy as np
import pytesseract
from PIL import Image

# Path of working folder on Disk
src_path = "<dir path>"

def get_string(img_path):
    # Read image with opencv
    img = cv2.imread(img_path)

    # Convert to gray
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Apply dilation and erosion to remove some noise
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)

    # Write image after removed noise
    cv2.imwrite(src_path + "removed_noise.png", img)

    #  Apply threshold to get image with only black and white
    #img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)

    # Write the image after apply opencv to do some ...

    cv2.imwrite(src_path + "thres.png", img)

    # Recognize text with tesseract for python
    result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

    # Remove template file
    #os.remove(temp)

    return result


print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")

print "------ Done -------"

但是输出不正确.输入文件是

But the output is incorrect.. The input file is,

收到的输出是"0001",而不是"D001"

The output received is '0001' instead of 'D001'

收到的输出是'3001'而不是'B001'

The output received is '3001' instead of 'B001'

从图像中检索正确的字符,以及训练pytesseract返回图像中所有字体类型的正确字符(包括粗体字符)所需的代码更改是什么

What is the required code changes to retrieve the right Characters from image, also to train the pytesseract to return the right characters for all font types in image[including Bold characters]

推荐答案

@Maaaaa指出了Tessearact识别文本错误的确切原因.

@Maaaaa has pointed out the exact reason for incorrect text recognition by Tessearact.

但是仍然可以通过在tesseract输出上应用一些后处理步骤来改善最终输出.如果有帮助,您可以考虑并使用以下几点:

But still you can improve your final output by applying some post processing steps on the tesseract output. Here are a few points that you can think about and use them if it helps:

  1. 尝试在Tesseract输入参数中禁用字典检查功能.
  2. 使用数据集中基于启发式的信息.从给定的示例图像中,我猜每个单词/序列的第一个字符是一个字母,因此您可以根据数据集用最可能的字母替换输出中的第一个数字, 例如,"0"可以替换为D,因此"0001"->"D001",对于其他情况也是如此.
  3. Tesseract还提供了字符级别识别置信度值,因此请使用该信息用具有最高置信度值的字符替换字符.
  1. Try disabling the dictionary check feature in Tesseract input parameters.
  2. Use heuristic based information from your dataset. From the given sample images in question, i guess first character of each word/sequence is an alphabet so you can replace first digit in your output with most probable alphabet based on your dataset, for example '0' can be replaced with D so '0001' -> 'D001', similarly for other cases too.
  3. Tesseract also provides the character level recognition confidence value, so use that information to replace the characters with the one having highest confidence value.

这篇关于Python-Pytesseract从图像中提取不正确的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆