如何为Google Tensorflow注意OCR创建定制的数据集? [英] how to create cutomized dataset for google tensorflow attention ocr?

查看:153
本文介绍了如何为Google Tensorflow注意OCR创建定制的数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我能够根据配置文件获取日期集. "charset_filename"文件中应包含什么内容?它应该是数据集中所有可能字符的集合吗?生成TFRecord文件时,我们将字符转换为整数ID,该文件应包含字符还是其ID?

I am able to create TFRecord file according to this question. But I don't know whether I should write all images into a single TFRecord file or create multiple TFRecord files. Also, I don't quite understand the config file for datesets. What content should be in "charset_filename" file? Should it be a collection of all posible chracters in the dataset? When generating TFRecord file, we converted charcters to integer ids, should this file include characters or their ids?

推荐答案

是否应将所有图像写入单个TFRecord文件或 创建多个TFRecord文件

whether I should write all images into a single TFRecord file or create multiple TFRecord files

这取决于训练数据的大小,并影响并行预取以填充队列.我建议每个分片〜1000个样本(一个tfrecord文件,后缀总数为num,例如/path/to/my/dataset-00000-of-00512).

It depends on size of the training data and has impact on parallel prefetching to fill queues. I'd recommend ~1000 samples per shard (a tfrecord file with a suffix num-of-total, e.g. /path/to/my/dataset-00000-of-00512).

"charset_filename"文件中应包含什么内容?

What content should be in "charset_filename" file?

它是一个文本文件,定义了整数ID和相应字符之间的映射.它具有以下格式: <id><TAB><character> 文件中的一行应为<nul>字符定义一个ID-模型到达序列末尾以将输出填充到固定长度时,模型将输出一个特殊字符.

It is a text file which defines the mapping between integer ids and corresponding characters. It has the following format: <id><TAB><character> one of rows in the file should define an id for the <nul> character - a special character the model outputs when it reached end of sequence to pad the output to a fixed length.

例如,这是FSNS数据集的字符集文件的摘录:

For example, here is an excerpt from the FSNS dataset's charset file:

0    
133 <nul>
1   l
2   ’
3   é
4   t

请注意,<SPACE>字符的id为= 0.

Note that the <SPACE> character has id=0.

应该是数据集中所有可能字符的集合吗?

Should it be a collection of all posible chracters in the dataset?

是的.该文件应为数据集中的所有字符定义ID到字符的映射.

yes. This file should define id-to-character mappings for all characters in the dataset.

在生成TFRecord文件时,我们将字符转换为整数ID, 该文件应包含字符还是其ID?

When generating TFRecord file, we converted charcters to integer ids, should this file include characters or their ids?

两者.文件中的每一行都应采用<id><TAB><character>形式.

both. Each line in the file should be in the form <id><TAB><character>.

这篇关于如何为Google Tensorflow注意OCR创建定制的数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆