Tesseract使用字母的子集 [英] Tesseract use subset of letters

查看：108 发布时间：2020/5/19 19:33:56 python linux ocr captcha tesseract

本文介绍了Tesseract使用字母的子集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Ubuntu Linux上使用tesseract-ocr软件包，我已经使用了一段时间，并且我认为为了提高OCR的准确性，我只需要字母中的一部分字母即可.我需要的字母是:

Im using tesseract-ocr package on Ubuntu Linux, I have been using it for a while and I think that in order to improve the accuracy of the OCR I only need a subset of letters from the alphabet. The letters I need are:

0123456789abcdefghijklmnopqrstuvwxyz

还有，甚至没有大写字母，有人可以帮我指示tesseract只匹配字母的子集吗?

and only that, not even capital letters, can anybody give me a hand on indicating tesseract to only match againts a subset of letters ?

谢谢

推荐答案

从 python-tesseract项目页面:

import tesseract
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz")
api.SetPageSegMode(tesseract.PSM_AUTO)

因此，只需在api.SetVariable中设置自己的字符集即可.

So just set your own collection of characters in api.SetVariable.

通过 tesseract-ocr项目常见问题解答

Tesseract 2.03 使用

TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");

在调用Init函数或将其放入文本文件之前 tessdata/configs/digits:

BEFORE calling an Init function or put this in a text file called tessdata/configs/digits:

tessedit_char_whitelist 0123456789

，然后您的命令行将变为:

and then your command line becomes:

tesseract image.tif outputbase nobatch digits

警告:直到老和新的配置变量合并，您必须具有nobatch 参数.

Warning: Until the old and new config variables get merged, you must have the nobatch parameter too.

Tesseract 3 一个数字配置文件已经创建，因此只需运行一个 tesseract命令是这样的:

Tesseract 3 A digits config file is already created, so just run a tesseract command like this:

tesseract imagename outputbase digits

这篇关于Tesseract使用字母的子集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Tesseract使用字母的子集 [英] Tesseract use subset of letters

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

Tesseract使用字母的子集 [英] Tesseract use subset of letters

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭