在Tesseract 3中添加新字体 [英] Adding New Fonts to Tesseract 3
问题描述
我正在尝试向tesseract ocr添加新字体.我正在关注本教程但我遇到了一些问题.
I'm trying to add new fonts to tesseract ocr. I'm following this tutorial but I'm having some problems.
这是我到目前为止所做的:
Here's what I've done so far:
-
创建培训文档
Create training document
convert eng.myfont.exp0.pdf eng.myfont.exp0.tif
火车Tesseract
Train Tesseract
tesseract eng.myfont.exp0.tif eng.myfont.exp0 batch.nochop makebox
这创建了我的eng.myfont.exp0.box文件.
This created my eng.myfont.exp0.box file.
我用moshpytt打开文件,并确保已正确检测到它.
I open the file with moshpytt and make sure it was detected correctly.
将盒子文件反馈回tesseract
Feed the box file back into tesseract
tesseract eng.myfont.exp0.tif eng.myfont.exp0.box nobatch box.train.stderr
我有这个结果:
带有Leptonica的Tesseract开源OCR引擎v3.03
APPLY_BOXES:
从boxfile中读取的框:146
找到146个好斑点.
TRAINING ...字体名称= myfont.exp0
生成了6个单词的训练数据
Tesseract Open Source OCR Engine v3.03 with Leptonica
APPLY_BOXES:
Boxes read from boxfile: 146
Found 146 good blobs.
TRAINING ... Font name = myfont.exp0
Generated training data for 6 words
- eng.myfont.exp0.box.tr文件和eng.myfont.exp0.box.txt生成
- tesseract 3.03
- leptonica-1.70
- libgif 4.1.6(?):libjpeg 8d:libpng 1.2.50:libtiff 4.0.3:zlib 1.2.8:webp 0.4.0
- Ubuntu 14.04.1 LTS
尝试检测框文件中使用的字符集(这是我卡住的地方)
try to detect the Character set used in the box file (this is where I get stuck)
unicharset_extractor *.box
结果:
unicharset_extractor:找不到命令
unicharset_extractor: command not found
我也尝试过unicharset_extractor eng.myfont.exp0.box
,结果相同.
I also tred unicharset_extractor eng.myfont.exp0.box
with the same result.
我正在使用:
推荐答案
Ubuntu 14.04省略了针对Tesseract 3.03 RC的培训工具.因此,要么退回到Tesseract 3.02,要么升级到应该具有的Ubuntu 14.10.
The training tools for Tesseract 3.03 RC were omitted from Ubuntu 14.04. So either fall back to Tesseract 3.02 or upgrade to Ubuntu 14.10, which should have it.
这篇关于在Tesseract 3中添加新字体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!