Tesseract混淆了两个数字 [英] Tesseract confuses two numbers
问题描述
我正在编写一个应用程序来扫描图像中的数字.
I'm writing an application to scan numbers from an image.
数字使用OCR-B字体,并且可能还包含+
和>
字符.
The numbers are using the OCR-B font and may also contain +
and >
characters.
这是我的原始图片:
即使将字符集限制为提到的字符,使用Tesseract的扫描也不是很好.由于找不到用于Tesseract的OCRB培训文件,因此我决定自己进行培训.
The scans using Tesseract weren't very good, even when limiting the character set to the mentioned characters. As I didn't find any OCRB training files for Tesseract, I decided to train it myself.
我创建了此培训图片,并从中创建了一个盒子文件.盒子文件正确,所有字母正确匹配.
I created this training image and made a box file from it. The box file is correct, all letters are matched correctly.
然后,我执行了此处描述的所有步骤,以创建其他必要的步骤文件.
Then I did all steps described here to create the other necessary files.
使用这个经过新训练的OCR-B tessdata集,我在源图像上获得了不错的结果,但有一个小错误:所有1
都误认为8
,反之亦然.用于处理图像的命令是
Using this newly trained OCR-B tessdata-set, I get pretty good results on the source image, with one little bug: All 1
s are mistaken for 8
s and vice-versa. The command used to process the image was
$ tesseract esr2c.tif ocrb-esr2c -l ocrb
,源图像的输出为
0800000001456> 8 00000195731208 8 01050008 023+ 08 0301226> 20
如果交换所有1
和8
并将其与源图像进行比较,则输出将是正确的(除了最后两个字母,我可以忽略).
If you swap all 1
s and 8
s and compare it to the source image, the output would be correct (except for the last two letters which I can ignore).
这怎么可能发生?在培训过程中我是否犯了一些错误?我该如何解决?
How could this happen? Did I do some mistake in the training process? How can I fix it?
推荐答案
您的包装盒文件中某处可能具有1和8的不正确值(字符).您可以使用
It's likely that somewhere in your box file has incorrect values (characters) for 1 and 8. You can verify using jTessBoxEditor program. If so, correct, regenerate the language data file, and try again.
这篇关于Tesseract混淆了两个数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!