Tesseract错误空间识别 [英] Tesseract False Space Recognition

查看:410
本文介绍了Tesseract错误空间识别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用tesseract识别序列号.这是可以接受的,常见的问题,例如错误识别零和"O",6和5或M和H. 除此之外,tesseract还为识别出的单词添加了空格,图像中没有空格.下图被识别为"HI 3H" .

I'm using tesseract to recognize a serial number. This works acceptable, common problem like false recognition of zero and "O", 6 and 5, or M and H exists. Beside by this tesseract adds spaces to the recognized words, where no space is in the image. The following image is recognized as "HI 3H".

此图片生成"FBKHJ 1R1"

所以tesseract添加了一个空格,尽管图像中实际上没有空格. 是否有可能使tesseract的间隔行为参数化?

So tesseract added a space, although there isn't really a space in the image. Is there a possibility parametrize the spacing behavior of tesseract?

修改

很抱歉,忘记了添加,我也有包含空格的序列号.因此,我无法删除识别的序列号内的所有空格.

I'm sorry, have forgot to add, that I also have serial numbers which include spaces. So I cannot delete all spaces inside the recognized serial number.

例如,下面的包含序列号中空格的图像将在tesseract识别后生成: J4 F1583BB .除了字符的识别是错误的之外,该图像还可以识别出正确的空格.

For example the following image containing a space in the serial number results after tesseract recognition into: J4 F1583BB. Beside that the recognition of the characters is false, the space is recognized correct with this image.

我对tesseract的实际参数是:

My actual parameters for tesseract are:

tesseract::TessBaseAPI tess;
tess.Init(NULL, "eng", tesseract::OEM_TESSERACT_ONLY);
tess.SetPageSegMode(tesseract::PSM_SINGLE_BLOCK);
tess.SetVariable("tessedit_char_whitelist",
            "ABCDEFGHIJKLMNOPQRSTUVWXYZ012345789");

char* out = tess.GetUTF8Text();
string text = string(out);

修改

从已经存在的答案中可以注意到,例如,"J"和"I"之间的间隔似乎比其他字符之间的间隔小.我选择的字体类型是Monotype字体.原因是我认为这有助于tesseract进行字符识别.每个字符都具有相同宽度的Monospace字体类型的缺点是内核(字符之间的间隔)不同. 请参见以下来源的示例图片来源

It is notices from already existing answers, that the space between the "J" and "I" for example seems to be little more, than between the other characters. The font-type I have chosen is a Monotype Font. Reason for this is that I thought, that this helps tesseract for character recognition. Drawback of such a Monospace font-type, where every character has the same width, is that the kernel (the space between the characters) varies. See example image of following source Source

您认为哪种字体类型会获得更好的识别效果?

Which font type do you think, will achieve better recognition results?

推荐答案

调整参数tosp_min_sane_kn_sp可能会有所帮助.我通过这样做解决了这个问题.

Adjusting parameter tosp_min_sane_kn_sp may help. I solved the problem by doing it.

如果没有帮助,则可以尝试其他tosp_*参数,或者解决空间源代码"tospace.cpp"

If it doesn't help, you may try other tosp_* paramters, or working around the space source code "tospace.cpp"

这篇关于Tesseract错误空间识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆