使用分层抽样的 tensorflow 数据集 [英] Using tensorflow dataset with stratified sampling

查看:58
本文介绍了使用分层抽样的 tensorflow 数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个 tensorflow 数据集

Given a tensorflow dataset

Train_dataset = tf.data.Dataset.from_tensor_slices((Train_Image_Filenames,Train_Image_Labels))
Train_dataset = Train_dataset.map(Parse_JPEG_Augmented)
...

我想对我的批次进行分层以处理类别不平衡问题.我找到了 tf.contrib.training.stratified_sample 并认为我可以通过以下方式使用它:

I would like to stratify my batches to deal with class imbalance. I found tf.contrib.training.stratified_sample and thought I could use it in the following way:

Train_dataset_iter = Train_dataset.make_one_shot_iterator()
Train_dataset_Image_Batch,Train_dataset_Label_Batch = Train_dataset_iter.get_next()
Train_Stratified_Images,Train_Stratified_Labels = tf.contrib.training.stratified_sample(Train_dataset_Image_Batch,Train_dataset_Label_Batch,[1/Classes]*Classes,Batch_Size)

但它给出了以下错误,我不确定这是否能让我保持 tensorflow 数据集的性能优势,因为我可能不得不通过 Train_Stratified_ImagesTrain_Stratified_Labels 通过 feed_dict ?

But it gives the following error and I'm not sure that this would allow me to keep the performance benefits of tensorflow dataset as I may have then have to pass Train_Stratified_Images and Train_Stratified_Labels via feed_dict ?

File "/xxx/xxx/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/training/python/training/sampling_ops.py", line 192, in stratified_sample
with ops.name_scope(name, 'stratified_sample', list(tensors) + [labels]):
File "/xxx/xxx/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 459, in __iter__
"Tensor objects are only iterable when eager execution is "
TypeError: Tensor objects are only iterable when eager execution is enabled. To iterate over this tensor use tf.map_fn.

使用具有分层批次的数据集的最佳实践"方法是什么?

What would be the "best practice" way of using dataset with stratified batches?

推荐答案

下面通过一个简单的例子来演示sample_from_datasets(感谢@Agade 的想法).

Here is below a simple example to demonstrate the usage of sample_from_datasets (thanks @Agade for the idea).

import math
import tensorflow as tf
import numpy as np


def print_dataset(name, dataset):
    elems = np.array([v.numpy() for v in dataset])
    print("Dataset {} contains {} elements :".format(name, len(elems)))
    print(elems)


def combine_datasets_balanced(dataset_smaller, size_smaller, dataset_bigger, size_bigger, batch_size):
    ds_smaller_repeated = dataset_smaller.repeat(count=int(math.ceil(size_bigger / size_smaller)))
    # we repeat the smaller dataset so that the 2 datasets are about the same size
    balanced_dataset = tf.data.experimental.sample_from_datasets([ds_smaller_repeated, dataset_bigger], weights=[0.5, 0.5])
    # each element in the resulting dataset is randomly drawn (without replacement) from dataset even with proba 0.5 or from odd with proba 0.5
    balanced_dataset = balanced_dataset.take(2 * size_bigger).batch(batch_size)
    return balanced_dataset


N, M = 3, 10
even = tf.data.Dataset.range(0, 2 * N, 2).repeat(count=int(math.ceil(M / N)))
odd = tf.data.Dataset.range(1, 2 * M, 2)
even_odd = combine_datasets_balanced(even, N, odd, M, 2)

print_dataset("even", even)
print_dataset("odd", odd)
print_dataset("even_odd_all", even_odd)

Output :

Dataset even contains 12 elements :  # 12 = 4 x N  (because of .repeat)
[0 2 4 0 2 4 0 2 4 0 2 4]
Dataset odd contains 10 elements :
[ 1  3  5  7  9 11 13 15 17 19]
Dataset even_odd contains 10 elements :  # 10 = 2 x M / 2  (2xM because of .take(2 * M) and /2 because of .batch(2))
[[ 0  2]
 [ 1  4]
 [ 0  2]
 [ 3  4]
 [ 0  2]
 [ 4  0]
 [ 5  2]
 [ 7  4]
 [ 0  9]
 [ 2 11]] 

这篇关于使用分层抽样的 tensorflow 数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆