在 tensorflow 中对不平衡数据集进行子采样 [英] Subsampling an unbalanced dataset in tensorflow

查看：39 发布时间：2021/9/5 19:48:13 python tensorflow tensorflow-datasets

本文介绍了在 tensorflow 中对不平衡数据集进行子采样的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这里是 Tensorflow 初学者.这是我的第一个项目，我正在使用预定义的估算器.

Tensorflow beginner here. This is my first project and I am working with pre-defined estimators.

我有一个极其不平衡的数据集，其中正面结果约占总数据的 0.1%，我怀疑这种不平衡会显着影响我的模型的性能.作为解决这个问题的第一次尝试，由于我有大量数据，我想扔掉大部分底片以创建一个平衡的数据集.我可以看到两种方法:预处理数据以仅保留千分之一的底片，然后将其保存在一个新文件中，然后再将其传递给 tensorflow，例如使用 pyspark；并要求 tensorflow 只使用它找到的一千个负数中的一个.

I have an extremely unbalanced dataset where positive outcomes represent roughly 0.1% of the total data and I suspect this imbalance to considerably affect the performance of my model. As a first attempt to solve the issue, since I have tons of data, I would like to throw away most of my negatives in order to create a balanced dataset. I can see two ways of doing it: preprocess the data to keep only a thousandth of the negatives then save it in a new file before passing it to tensorflow, for example with pyspark; and asking tensorflow to use only one negative out of a thousand it finds.

我试图编写最后一个想法，但没有成功.我修改了我的输入函数来读取

I tried to code this last idea but didn't manage. I modified my input function to read like

def train_input_fn(data_file="../data/train_input.csv", shuffle_size=100_000, batch_size=128):
    """Generate an input function for the Estimator."""

    dataset = tf.data.TextLineDataset(data_file)  # Extract lines from input files using the Dataset API.
    dataset = dataset.map(parse_csv, num_parallel_calls=3)
    dataset = dataset.shuffle(shuffle_size).repeat().batch(batch_size)

    iterator = dataset.make_one_shot_iterator()
    features, labels = iterator.get_next()

    # TRY TO IMPLEMENT THE SELECTION OF NEGATIVES
    thrown = 0
    flag = np.random.randint(1000)
    while labels == 0 and flag != 0:
        features, labels = iterator.get_next()
        thrown += 1
        flag = np.random.randint(1000)
    print("I've thrown away {} negative examples before going for label {}!".format(thrown, labels))
    return features, labels

这当然是行不通的，因为迭代器不知道里面有什么，所以 label==0 条件永远不会满足.此外，stdout 中只有一个打印，这意味着这个函数只被调用一次(这意味着我仍然不明白 tensorflow 是如何真正工作的).无论如何，有没有办法实现我想要的?

This, of course, doesn't work because iterators don't know what's inside them, so the labels==0 condition is never satisfied. Also, there is only one print in the stdout, meaning that this function is only called once (and meaning that I still don't understand how tensorflow really works). Anyways, is there a way to implement what I want?

PS:我怀疑之前的代码，即使它按预期工作，由于每次找到正数时都会重新开始计数，因此返回的负数不到初始负数的千分之一.这是一个小问题，到目前为止，我什至可以在标志中找到一个神奇的数字，它可以为我提供预期的结果，而不必担心它的数学之美.

PS: I suspect that the previous code, even if it worked as intended, would return less than a thousandth of the initial negatives due to the count restarting every time it finds a positive. This is a minor issue, and so far I could even find a magic number inside the flag that gives me the expected result without worrying too much about the mathematical beauty of it.

在 tensorflow 中对不平衡数据集进行子采样 [英] Subsampling an unbalanced dataset in tensorflow

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在 tensorflow 中对不平衡数据集进行子采样 [英] Subsampling an unbalanced dataset in tensorflow

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭