如何使用图像数据而不是字体文件训练tesseract 4? [英] How do I train tesseract 4 with image data instead of a font file?

查看:364
本文介绍了如何使用图像数据而不是字体文件训练tesseract 4?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用图像而不是字体来训练Tesseract 4.

I'm trying to train Tesseract 4 with images instead of fonts.

文档中,他们只是在解释使用字体的方法,而不是图像.

In the docs they are explaining only the approach with fonts, not with images.

当我使用Tesseract的早期版本时,我知道它是如何工作的,但是我没有得到如何使用box/tiff文件进行

I know how it works, when I use a prior version of Tesseract but I didn't get how to use the box/tiff files to train with LSTM in Tesseract 4.

我调查了 tesstrain .sh ,用于生成 LSTM 训练数据,但找不到任何有用的信息.有什么想法吗?

I looked into tesstrain.sh, which is used to generate LSTM training data but couldn't find anything helpful. Any ideas?

推荐答案

https:上克隆tesstrain存储库. //github.com/tesseract-ocr/tesstrain .

您还需要克隆tessdata_best仓库, https://github.com/tesseract- ocr/tessdata_best .这是您训练的起点.要获得准确性,需要数十万个训练数据样本,因此使用一个好的起点,您可以使用少得多的数据来微调您的训练(大约数十到数百个样本就足够了)

You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)

将训练样本添加到名为./tesstrain/data/my-custom-model-ground-truth

Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth

您的训练样本应该是图像/文本文件对,它们共享相同的名称但具有不同的扩展名.例如,您应该有一个名为001.png的图像文件,它是文本foobar的图片,而您应该有一个名为001.gt.txt的文本文件,其中文本为foobar.

Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png that is a picture of the text foobar and you should have a text file named 001.gt.txt that has the text foobar.

这些文件必须是一行文字.

These files need to be single lines of text.

tesstrain存储库中,运行以下命令:

In the tesstrain repo, run this command:

make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best

培训完成后,将有一个新文件tesstrain/data/.traineddata.将该文件复制到目录中,Tesseract搜索模型.在我的机器上,它是/usr/local/share/tessdata/.

Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.

然后,您可以运行tesseract并将该模型用作语言.

Then, you can run tesseract and use that model as a language.

tesseract -l my-custom-model foo.png -

这篇关于如何使用图像数据而不是字体文件训练tesseract 4?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆