如何使用图像数据而不是字体文件训练tesseract 4? [英] How do I train tesseract 4 with image data instead of a font file?

查看：364 发布时间：2020/5/4 6:21:38 ocr tesseract lstm training-data

本文介绍了如何使用图像数据而不是字体文件训练tesseract 4?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用图像而不是字体来训练Tesseract 4.

I'm trying to train Tesseract 4 with images instead of fonts.

在文档中，他们只是在解释使用字体的方法，而不是图像.

In the docs they are explaining only the approach with fonts, not with images.

当我使用Tesseract的早期版本时，我知道它是如何工作的，但是我没有得到如何使用box/tiff文件进行

I know how it works, when I use a prior version of Tesseract but I didn't get how to use the box/tiff files to train with LSTM in Tesseract 4.

我调查了 tesstrain .sh ，用于生成 LSTM 训练数据，但找不到任何有用的信息.有什么想法吗?

I looked into tesstrain.sh, which is used to generate LSTM training data but couldn't find anything helpful. Any ideas?

您还需要克隆tessdata_best仓库， https://github.com/tesseract- ocr/tessdata_best .这是您训练的起点.要获得准确性，需要数十万个训练数据样本，因此使用一个好的起点，您可以使用少得多的数据来微调您的训练(大约数十到数百个样本就足够了)

You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)

将训练样本添加到名为./tesstrain/data/my-custom-model-ground-truth

Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth

您的训练样本应该是图像/文本文件对，它们共享相同的名称但具有不同的扩展名.例如，您应该有一个名为001.png的图像文件，它是文本foobar的图片，而您应该有一个名为001.gt.txt的文本文件，其中文本为foobar.

Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png that is a picture of the text foobar and you should have a text file named 001.gt.txt that has the text foobar.

这些文件必须是一行文字.

These files need to be single lines of text.

在tesstrain存储库中，运行以下命令:

In the tesstrain repo, run this command:

make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best

培训完成后，将有一个新文件tesstrain/data/.traineddata.将该文件复制到目录中，Tesseract搜索模型.在我的机器上，它是/usr/local/share/tessdata/.

Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.

然后，您可以运行tesseract并将该模型用作语言.

Then, you can run tesseract and use that model as a language.

tesseract -l my-custom-model foo.png -

这篇关于如何使用图像数据而不是字体文件训练tesseract 4?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用图像数据而不是字体文件训练tesseract 4? [英] How do I train tesseract 4 with image data instead of a font file?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用图像数据而不是字体文件训练tesseract 4? [英] How do I train tesseract 4 with image data instead of a font file?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭