Tesseract混淆了两个数字 [英] Tesseract confuses two numbers

查看:87
本文介绍了Tesseract混淆了两个数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个应用程序来扫描图像中的数字.

I'm writing an application to scan numbers from an image.

数字使用OCR-B字体,并且可能还包含+>字符.

The numbers are using the OCR-B font and may also contain + and > characters.

这是我的原始图片:

即使将字符集限制为提到的字符,使用Tesseract的扫描也不是很好.由于找不到用于Tesseract的OCRB培训文件,因此我决定自己进行培训.

The scans using Tesseract weren't very good, even when limiting the character set to the mentioned characters. As I didn't find any OCRB training files for Tesseract, I decided to train it myself.

我创建了此培训图片,并从中创建了一个盒子文件.盒子文件正确,所有字母正确匹配.

I created this training image and made a box file from it. The box file is correct, all letters are matched correctly.

然后,我执行了此处描述的所有步骤,以创建其他必要的步骤文件.

Then I did all steps described here to create the other necessary files.

使用这个经过新训练的OCR-B tessdata集,我在源图像上获得了不错的结果,但有一个小错误:所有1都误认为8,反之亦然.用于处理图像的命令是

Using this newly trained OCR-B tessdata-set, I get pretty good results on the source image, with one little bug: All 1s are mistaken for 8s and vice-versa. The command used to process the image was

$ tesseract esr2c.tif ocrb-esr2c -l ocrb

,源图像的输出为

0800000001456> 8 00000195731208 8 01050008 023+ 08 0301226> 20

如果交换所有18并将其与源图像进行比较,则输出将是正确的(除了最后两个字母,我可以忽略).

If you swap all 1s and 8s and compare it to the source image, the output would be correct (except for the last two letters which I can ignore).

这怎么可能发生?在培训过程中我是否犯了一些错误?我该如何解决?

How could this happen? Did I do some mistake in the training process? How can I fix it?

推荐答案

您的包装盒文件中某处可能具有1和8的不正确值(字符).您可以使用

It's likely that somewhere in your box file has incorrect values (characters) for 1 and 8. You can verify using jTessBoxEditor program. If so, correct, regenerate the language data file, and try again.

这篇关于Tesseract混淆了两个数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆