如何使用 Tesseract 训练基于 Python 的 OCR 以使用不同的国民身份证进行训练? [英] How can I train my Python based OCR with Tesseract to train with different National Identity Cards?

查看:32
本文介绍了如何使用 Tesseract 训练基于 Python 的 OCR 以使用不同的国民身份证进行训练?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 python 制作一个 OCR 系统,该系统从 ID 卡中读取并给出图像的确切结果,但它没有给我正确的答案,因为超立方体读取了太多错误的字符.我如何训练 tesseract,让它完美地读取 ID 卡并为我们提供正确和准确的详细信息,此外我如何让自己进入 .tiff 文件并使 tesseract 为我的项目工作.

I am working with python to make an OCR system that reads from the ID Cards and give the exact results from the image but it is not giving me the righteous answers as there are so many wrong characters that the tesseract reads. How can I train tesseract in a way that it reads the ID card perfectly and gives us the right and exact details, furthermore how can I get myself to the .tiff file and to make tesseract work for my project.

推荐答案

提高 Pytesseract 识别度的步骤:

  1. 清理您的图像数组,以便只有文本(生成的字体,而不是手写的).字母的边缘应无变形.应用阈值(尝试不同的值).还应用一些平滑过滤器.我还建议使用 Morpholofical 打开/关闭 - 但这只是一个奖励.这是应该以数组形式输入pytesseract识别的夸张示例:https://i.ytimg.com/vi/1ns8tGgdpLY/maxresdefault.jpg

  1. Clean your image arrays so there is only text(font generated, not handwritten). The edges of letters should be without distortion. Apply threshold (try different values). Also apply some smoothing filters. I also recommend to use Morpholofical opening/closing - but thats only a bonus. This is exaggerated example of what should enter pytesseract recognition in form of array: https://i.ytimg.com/vi/1ns8tGgdpLY/maxresdefault.jpg

将带有您要识别的文本的图像调整为更高分辨率

Resize the image with text you want to recognize to higher resolution

Pytesseract 通常应该识别任何类型的字母,但是通过安装书写文本的字体,您可以极大地提高准确性.

Pytesseract should generally recognize letters of any kind, but by installing font in which the text is written, you are superbly increasing accuracy.

如何在 pytesseract 中安装新字体:

  1. 以 TIFF 格式获取所需的字体

  1. Get your desired font in TIFF format

将其上传到 http://trainyourtesseract.com/ 并在您的电子邮件中接收训练数据<强>(此站点不再存在.此时您必须自己寻找替代或训练字体)

Upload it to http://trainyourtesseract.com/ and receive trained data into your email ( This site doesnt exist anymore. At this moment you have to find alternative or train font yourself)

将经过训练的数据文件 (*.traineddata) 添加到此文件夹 C:\Program Files (x86)\Tesseract-OCR\tessdata

add the trained data file (*.traineddata) to this folder C:\Program Files (x86)\Tesseract-OCR\tessdata

将此字符串命令添加到pytesseract reconition函数:

add this string command to pytesseract reconition function:

  • 假设您有 2 种经过训练的字体:font1.traineddata 和 font2.traineddata

    • lets say you have 2 trained fonts: font1.traineddata and font2.traineddata

      要同时使用两者,请使用此命令

      To use both, use this command

      txt = pytesseract.image_to_string(img, lang='font1+font2')

      txt = pytesseract.image_to_string(img, lang='font1+font2')

      这是测试您对网络图像的识别的代码:

      import cv2
      import pytesseract
      import cv2
      import numpy as np
      import urllib
      import requests
      pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
      TESSDATA_PREFIX = 'C:/Program Files (x86)/Tesseract-OCR'
      from PIL import Image
      
      def url_to_image(url):
          resp = urllib.request.urlopen(url)
          image = np.asarray(bytearray(resp.read()), dtype="uint8")
          image = cv2.imdecode(image, cv2.IMREAD_COLOR)
          return image
      
      url='http://jeroen.github.io/images/testocr.png'
      
      
      img = url_to_image(url)
      
      
      #img = cv2.GaussianBlur(img,(5,5),0)
      img = cv2.medianBlur(img,5) 
      retval, img = cv2.threshold(img,150,255, cv2.THRESH_BINARY)
      txt = pytesseract.image_to_string(img, lang='eng')
      print('recognition:', txt)
      >>> txt
      'This ts a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format\n\nThe quick brown dog jumped over the\nlazy fox The quick brown dog jumped\nover the lazy fox The quick brown dog\njumped over the lazy fox The quick\nbrown dog jumped over the lazy fox'
      

      这篇关于如何使用 Tesseract 训练基于 Python 的 OCR 以使用不同的国民身份证进行训练?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆