Tesseract使用字母的子集 [英] Tesseract use subset of letters
问题描述
我在Ubuntu Linux上使用tesseract-ocr软件包,我已经使用了一段时间,并且我认为为了提高OCR的准确性,我只需要字母中的一部分字母即可.我需要的字母是:
Im using tesseract-ocr package on Ubuntu Linux, I have been using it for a while and I think that in order to improve the accuracy of the OCR I only need a subset of letters from the alphabet. The letters I need are:
0123456789abcdefghijklmnopqrstuvwxyz
还有,甚至没有大写字母,有人可以帮我指示tesseract只匹配字母的子集吗?
and only that, not even capital letters, can anybody give me a hand on indicating tesseract to only match againts a subset of letters ?
谢谢
推荐答案
import tesseract
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz")
api.SetPageSegMode(tesseract.PSM_AUTO)
因此,只需在api.SetVariable
中设置自己的字符集即可.
So just set your own collection of characters in api.SetVariable
.
Tesseract 2.03 使用
TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");
在调用Init函数或将其放入文本文件之前 tessdata/configs/digits:
BEFORE calling an Init function or put this in a text file called tessdata/configs/digits:
tessedit_char_whitelist 0123456789
,然后您的命令行将变为:
and then your command line becomes:
tesseract image.tif outputbase nobatch digits
警告:直到老 和新的配置变量合并,您必须具有nobatch 参数.
Warning: Until the old and new config variables get merged, you must have the nobatch parameter too.
Tesseract 3 一个数字配置文件已经创建,因此只需运行一个 tesseract命令是这样的:
Tesseract 3 A digits config file is already created, so just run a tesseract command like this:
tesseract imagename outputbase digits
这篇关于Tesseract使用字母的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!