如何混合不平衡的数据集以达到每个标签所需的分布? [英] How to mix unbalanced Datasets to reach a desired distribution per label?

查看:35
本文介绍了如何混合不平衡的数据集以达到每个标签所需的分布?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 ubuntu 16.04 上运行我的神经网络,有 1 个 GPU (GTX 1070) 和 4 个 CPU.

我的数据集包含大约 35,000 张图像,但数据集并不平衡:0 类占 90%,1、2、3、4 类共享其他 10%.因此,我通过使用 dataset.repeat(class_weight) [我也使用一个函数来应用随机增强] 对类 1-4 进行过采样,然后 concatenate 它们.>

重采样策略为:

1) 一开始,class_weight[n] 会被设置为一个很大的数字,这样每个类的图像数量将与类 0 相同.

2) 随着训练的进行,epoch 数增加,权重会根据 epoch 数下降,因此分布变得更接近实际分布.

因为我的 class_weight 会随着时代的变化而变化,所以我不能在一开始就打乱整个数据集.相反,我必须逐个类接收数据,并在连接每个类的过采样数据后对整个数据集进行混洗.而且,为了实现平衡的批次,我必须按元素对整个数据集进行洗牌.

以下是我的代码的一部分.

def my_estimator_func():d0 = tf.data.TextLineDataset(train_csv_0).map(_parse_csv_train)d1 = tf.data.TextLineDataset(train_csv_1).map(_parse_csv_train)d2 = tf.data.TextLineDataset(train_csv_2).map(_parse_csv_train)d3 = tf.data.TextLineDataset(train_csv_3).map(_parse_csv_train)d4 = tf.data.TextLineDataset(train_csv_4).map(_parse_csv_train)d1 = d1.repeat(class_weight[1])d2 = d2.repeat(class_weight[2])d3 = d3.repeat(class_weight[3])d4 = d4.repeat(class_weight[4])数据集 = d0.concatenate(d1).concatenate(d2).concatenate(d3).concatenate(d4)dataset = dataset.shuffle(180000) # <- 这就是问题的来源数据集 = dataset.batch(100)迭代器 = dataset.make_one_shot_iterator()特征,标签 = iterator.get_next()返回特征,标签def _parse_csv_train(行):parsed_line= tf.decode_csv(line, [[""], []])文件名 = parsed_line[0]标签 = parsed_line[1]image_string = tf.read_file(文件名)image_decoded = tf.image.decode_jpeg(image_string, channels=3)# my_random_augmentation_func 将对图像应用随机增强.image_aug = my_random_augmentation_func(image_decoded)image_resized = tf.image.resize_images(image_aug, image_resize)返回 image_resized, 标签

为了清楚起见,让我逐步描述为什么我会遇到这个问题:

  1. 因为我的数据集中的类不平衡,我想对这些少数类进行过度采样.

  2. 由于 1.,我想对这些类应用随机增强并将多数类(类 0)与它们连接起来.

  3. 经过研究,我发现repeat()如果有随机函数会产生不同的结果,所以我用repeat()和my_random_augmentation_func来实现2.

  4. 现在,已经实现了 2.,我想合并所有数据集,所以我使用 concatenate()

  5. 4. 我现在面临一个问题:总共有大约 40,000 - 180,000 张图像(因为 class_weight 逐个时期改变,一开始总共会有 180,000 张图像,最后将有大约 40,000),并且它们逐级连接,数据集将看起来像 [0000-1111-2222-3333-4444],因此批量大小为 100,没有任何改组,几乎总是有每批中只有一个类,这意味着每批中的分布会不平衡.

  6. 为了解决5.中的批次不平衡"问题,我想到了对整个数据集进行洗牌的想法,因此我使用了shuffle(180000).

  7. 最后,砰,我的电脑在处理数据集中的 180000 个项目时死机了.

那么,有没有更好的方法可以让我获得平衡的批次,但仍然保持我想要的特征(例如,逐个时期改变分布时期)?

--- 问题已解决---

原来是我不应该应用地图功能一开始.我应该只接收文件名而不是真实文件,然后将文件名打乱,然后将它们映射到真实文件.

更详细的,删除d0 = tf.data.TextLineDataset(train_csv_0)等4行后的map(_parse_csv_train)部分,新增一行dataset = dataset.map(_parse_csv_train) after shuffle(180000).

我还想对@P-Gn 说声谢谢,他洗牌"部分中的博客链接真的很有帮助.它回答了我脑海中的一个问题,但我没有问:我可以通过使用许多小洗牌与一次大洗牌来获得类似的随机性吗?"(我不会在这里给出答案,请查看该博客!)该博客中的方法也可能是解决此问题的潜在方法,但我还没有尝试过.

解决方案

我建议依靠 tf.contrib.data.choose_from_datasets,标签由 tf.multinomial 分布.与基于样本拒绝的其他功能相比,这样做的优势在于您不会丢失读取未使用样本的 I/O 带宽.

这是一个与您类似的案例的工作示例,带有一个虚拟数据集:

 将 tensorflow 导入为 tf# 创建虚拟数据集class_num_samples = [900, 25, 25, 25, 25]class_start = [0, 1000, 2000, 3000, 4000]ds = [tf.data.Dataset.range(class_start[0], class_start[0] + class_num_samples[0]),tf.data.Dataset.range(class_start[1], class_start[1] + class_num_samples[1]),tf.data.Dataset.range(class_start[2], class_start[2] + class_num_samples[2]),tf.data.Dataset.range(class_start[3], class_start[3] + class_num_samples[3]),tf.data.Dataset.range(class_start[4], class_start[4] + class_num_samples[4])]# 根据可参数化的分布从数据集中挑选class_relprob_ph = tf.placeholder(tf.float32, shape=len(class_num_samples))选择 = tf.data.Dataset.from_tensor_slices(tf.multinomial(tf.log(class_relprob_ph)[None], max(class_num_samples))[0])ds = tf.contrib.data.choose_from_datasets(ds, pick).repeat().batch(20)迭代器 = ds.make_initializable_iterator()批处理 = iterator.get_next()使用 tf.Session() 作为 sess:# 选择均匀分布sess.run(iterator.initializer, feed_dict={class_relprob_ph: [1, 1, 1, 1, 1]})打印(batch.eval())# [ 0 1000 1001 1 3000 4000 3001 4001 2 3 1002 1003 2000 4 5 2001 3002 1004 6 2002]# 现在遵循输入分布sess.run(iterator.initializer, feed_dict={class_relprob_ph: class_num_samples})打印(batch.eval())# [ 0 1 4000 2 3 4 5 3000 6 7 8 9 2000 10 11 12 13 4001 14 15]

请注意,时期"的长度现在由多项式采样的长度定义.我在这里有些随意地将它设置为 max(class_num_samples) — 当您开始混合不同长度的数据集时,确实没有好的选择来定义纪元.

然而,有一个具体的理由让它至少与最大的数据集一样大:正如你所注意到的,调用 iterator.initializer 从头开始​​重新启动 Dataset.因此,既然您的混洗缓冲区比您的数据小得多(通常是这种情况),重要的是不要过早重新启动以确保训练看到所有数据.

关于洗牌

这个答案解决了用自定义权重交错数据集的问题,而不是数据集改组的问题,这是一个不相关的问题.对大型数据集进行洗牌需要做出妥协——如果不以某种方式牺牲内存和性能,就无法进行有效的动态洗牌.例如,关于该主题的这篇出色的博客文章以图形方式说明了缓冲区的影响洗牌质量的大小.

I am running my neural network on ubuntu 16.04, with 1 GPU (GTX 1070) and 4 CPUs.

My dataset contains around 35,000 images, but the dataset is not balanced: class 0 has 90%, and class 1,2,3,4 share the other 10%. Therefore I over-sample class 1-4 by using dataset.repeat(class_weight) [I also use a function to apply random augmentation], and then concatenate them.

The re-sampling strategy is:

1) At the very beginning, class_weight[n] will be set to a large number so that each class will have the same amount of images as class 0.

2) As the training goes, number of epochs increases, and the weights will drop according to the epoch number, so that the distribution becomes closer to the actual distribution.

Because my class_weight will vary epoch by epoch, I can't shuffle the whole dataset at the very beginning. Instead, I have to take in data class by class, and shuffle the whole dataset after I concatenate the over-sampled data from each class. And, in order to achieve balanced batches, I have to element-wise shuffle the whole dataset.

The following is part of my code.

def my_estimator_func():
    d0 = tf.data.TextLineDataset(train_csv_0).map(_parse_csv_train)
    d1 = tf.data.TextLineDataset(train_csv_1).map(_parse_csv_train)
    d2 = tf.data.TextLineDataset(train_csv_2).map(_parse_csv_train)
    d3 = tf.data.TextLineDataset(train_csv_3).map(_parse_csv_train)
    d4 = tf.data.TextLineDataset(train_csv_4).map(_parse_csv_train)
    d1 = d1.repeat(class_weight[1])
    d2 = d2.repeat(class_weight[2])
    d3 = d3.repeat(class_weight[3])
    d4 = d4.repeat(class_weight[4])
    dataset = d0.concatenate(d1).concatenate(d2).concatenate(d3).concatenate(d4)    
    dataset = dataset.shuffle(180000) # <- This is where the issue comes from
    dataset = dataset.batch(100)
    iterator = dataset.make_one_shot_iterator()  
    feature, label = iterator.get_next()
    return feature, label

def _parse_csv_train(line):
    parsed_line= tf.decode_csv(line, [[""], []])
    filename = parsed_line[0]
    label = parsed_line[1]
    image_string = tf.read_file(filename)
    image_decoded = tf.image.decode_jpeg(image_string, channels=3)
    # my_random_augmentation_func will apply random augmentation on the image. 
    image_aug = my_random_augmentation_func(image_decoded)
    image_resized = tf.image.resize_images(image_aug, image_resize)
    return image_resized, label

To make it clear, let me describe why I am facing this issue step by step:

  1. Because classes in my dataset are not balanced, I want to over-sample those minority classes.

  2. Because of 1., I want to apply random augmentation on those classes and concatenate the majority class (class 0) with them.

  3. After doing some research, I find that repeat() will generate different results if there is a random function in it, so I use repeat() along with my_random_augmentation_func to achieve 2.

  4. Now, having achieved 2., I want to combine all the datasets, so I use concatenate()

  5. After 4. I am now facing an issue: there are around 40,000 - 180,000 images in total (because class_weight changes epoch by epoch, at the beginning there will be 180,000 images in total, and finally there will be about 40,000), and they are concatenated class by class, the dataset will look like [0000-1111-2222-3333-4444], therefore with batch size 100, without any shuffling, there will almost always be only one class in each batch, which means the distribution in each batch will be imbalanced.

  6. In order to solve the "imbalanced batch" issue in 5., I come up with the idea that I should shuffle the whole dataset, thus I use shuffle(180000).

  7. And finally, boom, my computer freeze when it comes to shuffle 180000 items in the dataset.

So, is there a better way that I can get balanced batches, but still keep the characteristics I want (e.g. changing distribution epoch by epoch) ?

--- Edit: Issue solved ---

It turned out that I should not apply the map function at the beginning. I should just take in the filenames instead of the real files, and just shuffle the filenames, then map them to the real files.

More detailedly, delete the map(_parse_csv_train) part after d0 = tf.data.TextLineDataset(train_csv_0) and other 4 lines, and add a new line dataset = dataset.map(_parse_csv_train) after shuffle(180000).

I also want to say thank you to @P-Gn , the blog link in his "shuffling" part is really helpful. It answered a question that was in my mind but I didn't ask: "Can I get similar randomness by using many small shuffles v.s. one large shuffle?" (I'm not gonna give an answer here, check that blog!) The method in that blog might also be a potential solution to this issue, but I haven't tried it out.

解决方案

I would suggest to rely on tf.contrib.data.choose_from_datasets, with labels picked by a tf.multinomial distribution. The advantage of this, compared to other functions based on sample rejection is that you do not loose I/O bandwidth reading unused samples.

Here is a working example on a case similar to yours, with a dummy dataset:

import tensorflow as tf

# create dummy datasets
class_num_samples = [900, 25, 25, 25, 25]
class_start = [0, 1000, 2000, 3000, 4000]
ds = [
  tf.data.Dataset.range(class_start[0], class_start[0] + class_num_samples[0]),
  tf.data.Dataset.range(class_start[1], class_start[1] + class_num_samples[1]),
  tf.data.Dataset.range(class_start[2], class_start[2] + class_num_samples[2]),
  tf.data.Dataset.range(class_start[3], class_start[3] + class_num_samples[3]),
  tf.data.Dataset.range(class_start[4], class_start[4] + class_num_samples[4])
]

# pick from dataset according to a parameterizable distribution
class_relprob_ph = tf.placeholder(tf.float32, shape=len(class_num_samples))
pick = tf.data.Dataset.from_tensor_slices(
  tf.multinomial(tf.log(class_relprob_ph)[None], max(class_num_samples))[0])
ds = tf.contrib.data.choose_from_datasets(ds, pick).repeat().batch(20)

iterator = ds.make_initializable_iterator()
batch = iterator.get_next()

with tf.Session() as sess:
  # choose uniform distribution
  sess.run(iterator.initializer, feed_dict={class_relprob_ph: [1, 1, 1, 1, 1]})
  print(batch.eval())
# [   0 1000 1001    1 3000 4000 3001 4001    2    3 1002 1003 2000    4    5 2001 3002 1004    6 2002]

  # now follow input distribution
  sess.run(iterator.initializer, feed_dict={class_relprob_ph: class_num_samples})
  print(batch.eval())
# [   0    1 4000    2    3    4    5 3000    6    7    8    9 2000   10   11   12   13 4001   14   15]

Note that the length of an "epoch" is now defined by the length of the multinomial sampling. I have set it somewhat arbitrarily to max(class_num_samples) here — there is indeed no good choice for a definition of an epoch when you start mixing datasets of different lengths.

However there is a concrete reason to have it at least as large as the largest dataset: as you noticed, calling iterator.initializer restart the Dataset from the beginning. Therefore, now that your shuffling buffer are much smaller than your data (which is usually the case), it is important not to restart to early on to make sure training sees all of the data.

About shuffling

This answer solves the problem of interleaving datasets with a custom weighing, not of dataset shuffling, which is an unrelated problem. Shuffling a large dataset requires making compromises — you cannot have an efficient dynamic shuffling without sacrificing on memory and performance somehow. There is for example this excellent blog post on that topic that illustrates graphically the impact of the buffer size on the quality of the shuffling.

这篇关于如何混合不平衡的数据集以达到每个标签所需的分布?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆