如何为Google Tensorflow注意OCR创建定制的数据集? [英] how to create cutomized dataset for google tensorflow attention ocr?
问题描述
我能够根据配置文件获取日期集. "charset_filename"文件中应包含什么内容?它应该是数据集中所有可能字符的集合吗?生成TFRecord文件时,我们将字符转换为整数ID,该文件应包含字符还是其ID?
I am able to create TFRecord file according to this question. But I don't know whether I should write all images into a single TFRecord file or create multiple TFRecord files. Also, I don't quite understand the config file for datesets. What content should be in "charset_filename" file? Should it be a collection of all posible chracters in the dataset? When generating TFRecord file, we converted charcters to integer ids, should this file include characters or their ids?
推荐答案
是否应将所有图像写入单个TFRecord文件或 创建多个TFRecord文件
whether I should write all images into a single TFRecord file or create multiple TFRecord files
这取决于训练数据的大小,并影响并行预取以填充队列.我建议每个分片〜1000个样本(一个tfrecord文件,后缀总数为num,例如/path/to/my/dataset-00000-of-00512
).
It depends on size of the training data and has impact on parallel prefetching to fill queues. I'd recommend ~1000 samples per shard (a tfrecord file with a suffix num-of-total, e.g. /path/to/my/dataset-00000-of-00512
).
"charset_filename"文件中应包含什么内容?
What content should be in "charset_filename" file?
它是一个文本文件,定义了整数ID和相应字符之间的映射.它具有以下格式:
<id><TAB><character>
文件中的一行应为<nul>
字符定义一个ID-模型到达序列末尾以将输出填充到固定长度时,模型将输出一个特殊字符.
It is a text file which defines the mapping between integer ids and corresponding characters. It has the following format:
<id><TAB><character>
one of rows in the file should define an id for the <nul>
character - a special character the model outputs when it reached end of sequence to pad the output to a fixed length.
例如,这是FSNS数据集的字符集文件的摘录:
For example, here is an excerpt from the FSNS dataset's charset file:
0
133 <nul>
1 l
2 ’
3 é
4 t
请注意,<SPACE>
字符的id为= 0.
Note that the <SPACE>
character has id=0.
应该是数据集中所有可能字符的集合吗?
Should it be a collection of all posible chracters in the dataset?
是的.该文件应为数据集中的所有字符定义ID到字符的映射.
yes. This file should define id-to-character mappings for all characters in the dataset.
在生成TFRecord文件时,我们将字符转换为整数ID, 该文件应包含字符还是其ID?
When generating TFRecord file, we converted charcters to integer ids, should this file include characters or their ids?
两者.文件中的每一行都应采用<id><TAB><character>
形式.
both. Each line in the file should be in the form <id><TAB><character>
.
这篇关于如何为Google Tensorflow注意OCR创建定制的数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!