Tensorflow 数据集 API 中的过采样功能 [英] Oversampling functionality in Tensorflow dataset API

查看:109
本文介绍了Tensorflow 数据集 API 中的过采样功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请问当前数据集的API是否允许实现过采样算法?我处理高度不平衡的阶级问题.我认为在数据集解析(即在线生成)期间对特定类进行过采样会很好.我已经看到了rejection_resample 函数的实现,但是这会删除样本而不是复制它们,并且它会减慢批处理的生成速度(当目标分布与初始分布大不相同时).我想实现的是:举个例子,看它的类概率决定是否复制它.然后调用 dataset.shuffle(...) dataset.batch(...) 并获取迭代器.最好的(在我看来)方法是对低概率类别进行过采样,并对最可能的类别进行二次采样.我想在线进行,因为它更灵活.

解决方案

此问题已在 issue #14451.只需在此处发布 anwser 即可让其他开发者更容易看到它.

示例代码对低频类进行过采样,对高频类进行欠采样,其中 class_target_prob 在我的情况下只是均匀分布.我想从最近的手稿对卷积神经网络中的类不平衡问题的系统研究中检查一些结论>

特定类的过采样通过调用:

dataset = dataset.flat_map(lambda x: tf.data.Dataset.from_tensors(x).repeat(oversample_classes(x)))

这是完成所有事情的完整片段:

#采样参数oversampling_coef = 0.9 # 如果等于 0,则 oversample_classes() 总是返回 1undersampling_coef = 0.5 # 如果等于 0,则 undersampling_filter() 总是返回 Truedef oversample_classes(示例):"""返回给定示例的副本数"""class_prob = 示例['class_prob']class_target_prob = 示例['class_target_prob']prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)# 软化率是 oversampling_coef==0 我们恢复原始分布prob_ratio = prob_ratio ** oversampling_coef# 对于概率高于 class_target_prob 的类,我们# 想要返回 1prob_ratio = tf.maximum(prob_ratio, 1)# 对于低概率类别,这个数字会非常大重复计数 = tf.floor(prob_ratio)# prob_ratio 可以是例如 1.9 这意味着还有 90%# 我们应该返回 2 而不是 1 的变化repeat_residual = prob_ratio - repeat_count # 0-1 之间的数字残差接受 = tf.less_equal(tf.random_uniform([], dtype=tf.float32), repeat_residual)残留接受= tf.cast(残留接受,tf.int64)重复计数 = tf.cast(repeat_count, dtype=tf.int64)返回重复计数 + 残差接受def undersampling_filter(示例):"""如果给定的示例被拒绝,则计算."""class_prob = 示例['class_prob']class_target_prob = 示例['class_target_prob']prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)prob_ratio = prob_ratio ** undersampling_coefprob_ratio = tf.minimum(prob_ratio, 1.0)接受 = tf.less_equal(tf.random_uniform([], dtype=tf.float32), prob_ratio)退货受理数据集 = dataset.flat_map(lambda x: tf.data.Dataset.from_tensors(x).repeat(oversample_classes(x)))数据集 = dataset.filter(undersampling_filter)数据集 = dataset.repeat(-1)数据集 = dataset.shuffle(2048)数据集 = dataset.batch(32)sess.run(tf.global_variables_initializer())迭代器 = dataset.make_one_shot_iterator()next_element = iterator.get_next()

更新 #1

这是一个简单的 jupyter notebook,它实现了玩具模型上的上述过采样/欠采样.

I would like to ask if current API of datasets allows for implementation of oversampling algorithm? I deal with highly imbalanced class problem. I was thinking that it would be nice to oversample specific classes during dataset parsing i.e. online generation. I've seen the implementation for rejection_resample function, however this removes samples instead of duplicating them and its slows down batch generation (when target distribution is much different then initial one). The thing I would like to achieve is: to take an example, look at its class probability decide if duplicate it or not. Then call dataset.shuffle(...) dataset.batch(...) and get iterator. The best (in my opinion) approach would be to oversample low probable classes and subsample most probable ones. I would like to do it online since it's more flexible.

解决方案

This problem has been solved in issue #14451. Just posting the anwser here to make it more visible to other developers.

The sample code is oversampling low frequent classes and undersampling high frequent ones, where class_target_prob is just uniform distribution in my case. I wanted to check some conclusions from recent manuscript A systematic study of the class imbalance problem in convolutional neural networks

The oversampling of specific classes is done by calling:

dataset = dataset.flat_map(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(oversample_classes(x))
)

Here is the full snippet which does all the things:

# sampling parameters
oversampling_coef = 0.9  # if equal to 0 then oversample_classes() always returns 1
undersampling_coef = 0.5  # if equal to 0 then undersampling_filter() always returns True

def oversample_classes(example):
    """
    Returns the number of copies of given example
    """
    class_prob = example['class_prob']
    class_target_prob = example['class_target_prob']
    prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)
    # soften ratio is oversampling_coef==0 we recover original distribution
    prob_ratio = prob_ratio ** oversampling_coef 
    # for classes with probability higher than class_target_prob we
    # want to return 1
    prob_ratio = tf.maximum(prob_ratio, 1) 
    # for low probability classes this number will be very large
    repeat_count = tf.floor(prob_ratio)
    # prob_ratio can be e.g 1.9 which means that there is still 90%
    # of change that we should return 2 instead of 1
    repeat_residual = prob_ratio - repeat_count # a number between 0-1
    residual_acceptance = tf.less_equal(
                        tf.random_uniform([], dtype=tf.float32), repeat_residual
    )

    residual_acceptance = tf.cast(residual_acceptance, tf.int64)
    repeat_count = tf.cast(repeat_count, dtype=tf.int64)

    return repeat_count + residual_acceptance


def undersampling_filter(example):
    """
    Computes if given example is rejected or not.
    """
    class_prob = example['class_prob']
    class_target_prob = example['class_target_prob']
    prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)
    prob_ratio = prob_ratio ** undersampling_coef
    prob_ratio = tf.minimum(prob_ratio, 1.0)

    acceptance = tf.less_equal(tf.random_uniform([], dtype=tf.float32), prob_ratio)

    return acceptance


dataset = dataset.flat_map(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(oversample_classes(x))
)

dataset = dataset.filter(undersampling_filter)

dataset = dataset.repeat(-1)
dataset = dataset.shuffle(2048)
dataset = dataset.batch(32)

sess.run(tf.global_variables_initializer())

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

Update #1

Here is a simple jupyter notebook which implements the above oversampling/undersampling on a toy model.

这篇关于Tensorflow 数据集 API 中的过采样功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆