Tesseract培训新字体 [英] Tesseract training for a new font

查看:96
本文介绍了Tesseract培训新字体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我还是Tesseract OCR的新手,在我的脚本中使用它之后,发现它尝试提取文本的图像的错误率相对较高.我遇到过Tesseract培训,据说该培训可以减少您使用的特定字体的错误率.我遇到了一个网站( http://ocr7.com/),该网站由Anyline提供支持,可以完成所有训练您指定的字体.因此,我收到了一个.traineddata文件,但我不确定该如何处理.谁能解释这个文件我需要做什么呢?或者我应该只是学习如何手动进行Tesseract培训,根据Anyline网站的说法,这可能需要一天的工作.提前致谢.

I'm still new to Tesseract OCR and after using it in my script noticed it had a relatively big error rate for the images I was trying to extract text from. I came across Tesseract training, which supposedly would be able to decrease error rate for a specific font you'd use. I came across a website (http://ocr7.com/) which is a tool powered by Anyline to do all the training for a font you specify. So I recieved a .traineddata file and I am not quite sure what to do with it. Could anybody explain what I have to do with this file for it to work? Or should I just learn how to do Tesseract training the manual way, which according to the Anyline website may take a day's work. Thanks in advance.

推荐答案

对于仍在阅读本文的任何人,您都可以使用此工具获取所需字体的训练数据文件.之后,将已训练的数据文件移动到tessdata文件夹中.要将tesseract与Python或任何其他语言(我认为吗?)中的新字体一起使用,请将lang = "Font"用作image_to_string函数中的第二个参数.它可以显着提高准确性,但仍然会犯错误的路线.或者,您也可以使用本指南来学习如何手动训练tesseract以新字体:

For anyone that is still going to read this, you can use this tool to get a traineddata file of whichever font you want. After that move the traineddata file in your tessdata folder. To use tesseract with the new font in Python or any other language (I think?) put lang = "Font"as second parameter in image_to_string function. It improves accuracy significantly but can still make mistakes ofcourse. Or you can just learn how to train tesseract for a new font manually with this guide: http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/.

这篇关于Tesseract培训新字体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆