Tesseract:多页训练文件与多个单独文件的优势? [英] Tesseract: Advantage to Multi-Page Training File vs. Multiple Separate Files?

查看:31
本文介绍了Tesseract:多页训练文件与多个单独文件的优势?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个 SO answer 表明使用 .tif 文件训练 tesseract 比.png 文件,因为 .tif 文件可以有多个页面,因此训练样本更大.然而,这个 SO 问题讨论了训练程序一次多张图像.更重要的是,例如 man 页面mftraining 建议它可以接受多个训练文件.

This SO answer suggests that training tesseract with .tif files has an advantage over .png files because the .tif files can have multiple pages and thus a larger training sample. Yet, this SO question discusses procedures for training with multiple images at once. More so, the man page for, e.g. mftraining suggests that it can accept multiple training files.

是否有任何理由不使用多个单独的图像文件进行训练?

Is there any reason then not to train with multiple separate image files?

推荐答案

看来,使用多个图像在单个字体上训练 tesseract 似乎工作得很好.下面是我采用的工作流程的草图:

It appears that using multiple images to train tesseract on a single font seems to work just fine. Below is a sketch of the workflow I employ:

# Convert files to .pdf
convert -density 600 Page1.pdf eng1.MyNewFont.exp1.png
convert -density 600 Page2.pdf eng1.MyNewFont.exp2.png

# Create .box files
tesseract eng1.MyNewFont.exp1.png eng1.MyNewFont.exp1 -l eng batch.nochop makebox
tesseract eng1.MyNewFont.exp2.png eng1.MyNewFont.exp2 -l eng batch.nochop makebox

## correct boxes with jTessBoxEditor or another box editor ##

# Create two new box.tr files: eng1.MyNewFont.exp1.box.tr and eng1.MyNewFont.exp2.box.tr

tesseract eng1.MyNewFont.exp1.png eng1.MyNewFont.exp1.box -l eng1 nobatch box.train.stderr
tesseract eng1.MyNewFont.exp2.png eng1.MyNewFont.exp2.box -l eng1 nobatch box.train.stderr

# Extract characters from the two .box files
unicharset_extractor eng1.MyNewFont.exp1.box eng1.MyNewFont.exp2.box 

echo "MyNewFont 0 0 0 0 0" >> font_properties

# train using the two new box.tr files.
mftraining -F font_properties -U unicharset -O eng1.unicharset eng1.MyNewFont.exp1.box.tr eng1.MyNewFont.exp2.box.tr 
cntraining eng1.MyNewFont.exp1.box.tr eng1.MyNewFont.exp2.box.tr

## rename files
mv inttemp  eng1.inttemp
mv normproto  eng1.normproto
mv pffmtable  eng1.pffmtable
mv shapetable  eng1.shapetable

combine_tessdata eng1. ## create .traineddata file.

这篇关于Tesseract:多页训练文件与多个单独文件的优势?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆