每个tfrecord中的示例数 [英] Number of examples in each tfrecord
问题描述
按照flowers示例的步骤,在Google Cloud Shell中运行sample.sh脚本,以对图像集进行以下预处理.
Running the sample.sh script in Google Cloud Shell to call the below preprocess on set of images following the steps of flowers example.
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/flowers/trainer/preprocess.py
在评估集和训练集上,预处理均成功完成.但是生成的.tfrecord.gz文件似乎与eval/train_set.csv中的图像编号不匹配.
Preprocess was successfully on both eval set and train set. But the generated .tfrecord.gz files does not seem matching the image numbers in eval/train_set.csv.
即eval-00000-of-00157.tfrecord.gz表示有158个tfrecord,而eval_set.csv中有35227行.每条记录均包含有效的image_url(均已上传到存储空间),每条记录均带有有效标签.
i.e. eval-00000-of-00157.tfrecord.gz says there are 158 tfrecord while there are 35227 rows in eval_set.csv. Each record include a valid image_url (all of them are uploaded to Storage), each record has valid label tagged.
想知道是否有一种方法可以监视和控制preproces.py配置中每个tfrecord的图像数量.
Would like to know if there is a way to monitor and control the number of images per tfrecord in preproces.py config.
谢谢
更新,正确完成此工作:
Update, got this work out right:
import tensorflow as tf
import os
from tensorflow.python.lib.io import file_io
options = tf.python_io.TFRecordOptions(
compression_type=tf.python_io.TFRecordCompressionType.GZIP)
sum(1 for f in file_io.get_matching_files(os.path.join(url/path, '*.tfrecord.gz'))
for example in tf.python_io.tf_record_iterator(f, options=options))
推荐答案
文件名eval-00000-of-00157.tfrecord.gz
表示这是158个文件中的第一个文件.应该有157个类似名称的文件.每个文件中可以有任意数量的记录.
The filename eval-00000-of-00157.tfrecord.gz
means that this is the first file out of 158. There should be 157 similarly named files. Within each file, there can be any number of records.
如果要手动计算每条记录,请尝试以下操作:
If you want to manually count each record, try something like:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
files = os.path.join('gs://my_bucket/my_dir', 'eval-*.tfrecord.gz')
print(sum(1 for f in tf.python_io.file_io.get_matching_files(files)
for tf.python_io.tf_record_iterator(f)))
请注意,Dataflow无法保证文件数量与输入文件和输出文件之间的记录顺序(文件间和文件内)之间的关系.但是,计数应该相同.
Note that there is no guarantee from Dataflow as to the relationship between the number of files and ordering of records (inter- and intra-file) between input files and output files. However, the counts should be the same.
这篇关于每个tfrecord中的示例数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!