在 tensorflow 中对不平衡数据集进行子采样 [英] Subsampling an unbalanced dataset in tensorflow

查看:39
本文介绍了在 tensorflow 中对不平衡数据集进行子采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是 Tensorflow 初学者.这是我的第一个项目,我正在使用预定义的估算器.

Tensorflow beginner here. This is my first project and I am working with pre-defined estimators.

我有一个极其不平衡的数据集,其中正面结果约占总数据的 0.1%,我怀疑这种不平衡会显着影响我的模型的性能.作为解决这个问题的第一次尝试,由于我有大量数据,我想扔掉大部分底片以创建一个平衡的数据集.我可以看到两种方法:预处理数据以仅保留千分之一的底片,然后将其保存在一个新文件中,然后再将其传递给 tensorflow,例如使用 pyspark;并要求 tensorflow 只使用它找到的一千个负数中的一个.

I have an extremely unbalanced dataset where positive outcomes represent roughly 0.1% of the total data and I suspect this imbalance to considerably affect the performance of my model. As a first attempt to solve the issue, since I have tons of data, I would like to throw away most of my negatives in order to create a balanced dataset. I can see two ways of doing it: preprocess the data to keep only a thousandth of the negatives then save it in a new file before passing it to tensorflow, for example with pyspark; and asking tensorflow to use only one negative out of a thousand it finds.

我试图编写最后一个想法,但没有成功.我修改了我的输入函数来读取

I tried to code this last idea but didn't manage. I modified my input function to read like

def train_input_fn(data_file="../data/train_input.csv", shuffle_size=100_000, batch_size=128):
    """Generate an input function for the Estimator."""

    dataset = tf.data.TextLineDataset(data_file)  # Extract lines from input files using the Dataset API.
    dataset = dataset.map(parse_csv, num_parallel_calls=3)
    dataset = dataset.shuffle(shuffle_size).repeat().batch(batch_size)

    iterator = dataset.make_one_shot_iterator()
    features, labels = iterator.get_next()

    # TRY TO IMPLEMENT THE SELECTION OF NEGATIVES
    thrown = 0
    flag = np.random.randint(1000)
    while labels == 0 and flag != 0:
        features, labels = iterator.get_next()
        thrown += 1
        flag = np.random.randint(1000)
    print("I've thrown away {} negative examples before going for label {}!".format(thrown, labels))
    return features, labels

这当然是行不通的,因为迭代器不知道里面有什么,所以 label==0 条件永远不会满足.此外,stdout 中只有一个打印,这意味着这个函数只被调用一次(这意味着我仍然不明白 tensorflow 是如何真正工作的).无论如何,有没有办法实现我想要的?

This, of course, doesn't work because iterators don't know what's inside them, so the labels==0 condition is never satisfied. Also, there is only one print in the stdout, meaning that this function is only called once (and meaning that I still don't understand how tensorflow really works). Anyways, is there a way to implement what I want?

PS:我怀疑之前的代码,即使它按预期工作,由于每次找到正数时都会重新开始计数,因此返回的负数不到初始负数的千分之一.这是一个小问题,到目前为止,我什至可以在标志中找到一个神奇的数字,它可以为我提供预期的结果,而不必担心它的数学之美.

PS: I suspect that the previous code, even if it worked as intended, would return less than a thousandth of the initial negatives due to the count restarting every time it finds a positive. This is a minor issue, and so far I could even find a magic number inside the flag that gives me the expected result without worrying too much about the mathematical beauty of it.

推荐答案

通过对代表性不足的类进行过采样,而不是丢弃过度代表性的类中的数据,您可能会获得更好的结果.通过这种方式,您可以保持过度代表类中的差异.您不妨使用您拥有的数据.

You will probably get better results by oversampling your under-represented class rather than throwing away data in your over-represented class. This way you keep the variance in the over-represented class. You might as well use the data you have.

实现这一目标的最简单方法可能是创建两个数据集,每个类一个.然后您可以使用 Dataset.interleave 从两个数据集中进行同等采样.

The easiest way to achieve this is probably to create two Datasets, one for each class. Then you can use Dataset.interleave to sample equally from both datasets.

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#interleave

这篇关于在 tensorflow 中对不平衡数据集进行子采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆