如何使用 Tesseract 训练基于 Python 的 OCR 以使用不同的国民身份证进行训练? [英] How can I train my Python based OCR with Tesseract to train with different National Identity Cards?
问题描述
我正在使用 python 制作一个 OCR 系统,该系统从 ID 卡中读取并给出图像的确切结果,但它没有给我正确的答案,因为超立方体读取了太多错误的字符.我如何训练 tesseract,让它完美地读取 ID 卡并为我们提供正确和准确的详细信息,此外我如何让自己进入 .tiff 文件并使 tesseract 为我的项目工作.
I am working with python to make an OCR system that reads from the ID Cards and give the exact results from the image but it is not giving me the righteous answers as there are so many wrong characters that the tesseract reads. How can I train tesseract in a way that it reads the ID card perfectly and gives us the right and exact details, furthermore how can I get myself to the .tiff file and to make tesseract work for my project.
推荐答案
提高 Pytesseract 识别度的步骤:
清理您的图像数组,以便只有文本(生成的字体,而不是手写的).字母的边缘应无变形.应用阈值(尝试不同的值).还应用一些平滑过滤器.我还建议使用 Morpholofical 打开/关闭 - 但这只是一个奖励.这是应该以数组形式输入pytesseract识别的夸张示例:https://i.ytimg.com/vi/1ns8tGgdpLY/maxresdefault.jpg
Clean your image arrays so there is only text(font generated, not handwritten). The edges of letters should be without distortion. Apply threshold (try different values). Also apply some smoothing filters. I also recommend to use Morpholofical opening/closing - but thats only a bonus. This is exaggerated example of what should enter pytesseract recognition in form of array: https://i.ytimg.com/vi/1ns8tGgdpLY/maxresdefault.jpg
将带有您要识别的文本的图像调整为更高分辨率
Resize the image with text you want to recognize to higher resolution
Pytesseract 通常应该识别任何类型的字母,但是通过安装书写文本的字体,您可以极大地提高准确性.
Pytesseract should generally recognize letters of any kind, but by installing font in which the text is written, you are superbly increasing accuracy.
如何在 pytesseract 中安装新字体:
以 TIFF 格式获取所需的字体
Get your desired font in TIFF format
将其上传到 http://trainyourtesseract.com/ 并在您的电子邮件中接收训练数据<强>(此站点不再存在.此时您必须自己寻找替代或训练字体)
Upload it to http://trainyourtesseract.com/ and receive trained data into your email ( This site doesnt exist anymore. At this moment you have to find alternative or train font yourself)
将经过训练的数据文件 (*.traineddata) 添加到此文件夹 C:\Program Files (x86)\Tesseract-OCR\tessdata
add the trained data file (*.traineddata) to this folder C:\Program Files (x86)\Tesseract-OCR\tessdata
将此字符串命令添加到pytesseract reconition函数:
add this string command to pytesseract reconition function:
假设您有 2 种经过训练的字体:font1.traineddata 和 font2.traineddata
lets say you have 2 trained fonts: font1.traineddata and font2.traineddata
要同时使用两者,请使用此命令
To use both, use this command
txt = pytesseract.image_to_string(img, lang='font1+font2')
txt = pytesseract.image_to_string(img, lang='font1+font2')
这是测试您对网络图像的识别的代码:
import cv2 import pytesseract import cv2 import numpy as np import urllib import requests pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract' TESSDATA_PREFIX = 'C:/Program Files (x86)/Tesseract-OCR' from PIL import Image def url_to_image(url): resp = urllib.request.urlopen(url) image = np.asarray(bytearray(resp.read()), dtype="uint8") image = cv2.imdecode(image, cv2.IMREAD_COLOR) return image url='http://jeroen.github.io/images/testocr.png' img = url_to_image(url) #img = cv2.GaussianBlur(img,(5,5),0) img = cv2.medianBlur(img,5) retval, img = cv2.threshold(img,150,255, cv2.THRESH_BINARY) txt = pytesseract.image_to_string(img, lang='eng') print('recognition:', txt) >>> txt 'This ts a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format\n\nThe quick brown dog jumped over the\nlazy fox The quick brown dog jumped\nover the lazy fox The quick brown dog\njumped over the lazy fox The quick\nbrown dog jumped over the lazy fox'
这篇关于如何使用 Tesseract 训练基于 Python 的 OCR 以使用不同的国民身份证进行训练?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!