如何为Google Tensorflow注意OCR创建定制的数据集? [英] how to create cutomized dataset for google tensorflow attention ocr?

查看：153 发布时间：2020/5/19 19:38:31 python tensorflow ocr

本文介绍了如何为Google Tensorflow注意OCR创建定制的数据集?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我能够根据配置文件获取日期集. "charset_filename"文件中应包含什么内容?它应该是数据集中所有可能字符的集合吗?生成TFRecord文件时，我们将字符转换为整数ID，该文件应包含字符还是其ID?

I am able to create TFRecord file according to this question. But I don't know whether I should write all images into a single TFRecord file or create multiple TFRecord files. Also, I don't quite understand the config file for datesets. What content should be in "charset_filename" file? Should it be a collection of all posible chracters in the dataset? When generating TFRecord file, we converted charcters to integer ids, should this file include characters or their ids?

推荐答案

是否应将所有图像写入单个TFRecord文件或创建多个TFRecord文件

whether I should write all images into a single TFRecord file or create multiple TFRecord files

这取决于训练数据的大小，并影响并行预取以填充队列.我建议每个分片〜1000个样本(一个tfrecord文件，后缀总数为num，例如/path/to/my/dataset-00000-of-00512).

It depends on size of the training data and has impact on parallel prefetching to fill queues. I'd recommend ~1000 samples per shard (a tfrecord file with a suffix num-of-total, e.g. /path/to/my/dataset-00000-of-00512).

"charset_filename"文件中应包含什么内容?

What content should be in "charset_filename" file?

它是一个文本文件，定义了整数ID和相应字符之间的映射.它具有以下格式: <id><TAB><character> 文件中的一行应为<nul>字符定义一个ID-模型到达序列末尾以将输出填充到固定长度时，模型将输出一个特殊字符.

It is a text file which defines the mapping between integer ids and corresponding characters. It has the following format: <id><TAB><character> one of rows in the file should define an id for the <nul> character - a special character the model outputs when it reached end of sequence to pad the output to a fixed length.

例如，这是FSNS数据集的字符集文件的摘录:

For example, here is an excerpt from the FSNS dataset's charset file:

0    
133 <nul>
1   l
2   ’
3   é
4   t

请注意，<SPACE>字符的id为= 0.

Note that the <SPACE> character has id=0.

应该是数据集中所有可能字符的集合吗?

Should it be a collection of all posible chracters in the dataset?

是的.该文件应为数据集中的所有字符定义ID到字符的映射.

yes. This file should define id-to-character mappings for all characters in the dataset.

在生成TFRecord文件时，我们将字符转换为整数ID，该文件应包含字符还是其ID?

When generating TFRecord file, we converted charcters to integer ids, should this file include characters or their ids?

两者.文件中的每一行都应采用<id><TAB><character>形式.

both. Each line in the file should be in the form <id><TAB><character>.

这篇关于如何为Google Tensorflow注意OCR创建定制的数据集?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何为Google Tensorflow注意OCR创建定制的数据集? [英] how to create cutomized dataset for google tensorflow attention ocr?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何为Google Tensorflow注意OCR创建定制的数据集? [英] how to create cutomized dataset for google tensorflow attention ocr?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭