当收到“DirectedInterleave 选择耗尽输入"警告时,TensorFlow 的“sample_from_datasets"是否仍然从数据集中采样? [英] Does TensorFlow's `sample_from_datasets` still sample from a Dataset when getting a `DirectedInterleave selected an exhausted input` warning?

查看:31
本文介绍了当收到“DirectedInterleave 选择耗尽输入"警告时,TensorFlow 的“sample_from_datasets"是否仍然从数据集中采样?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当使用 TensorFlow 的 tf.data.experimental.sample_from_datasets 为了从两个非常不平衡的数据集中进行同样的采样,我最终得到了一个 DirectedInterleave selected an Exhausted input: 0 警告.基于 this GitHub issue,这似乎发生在以下情况之一sample_from_datasets 中的 Datasets 已经没有示例,需要对已经看到的示例进行采样.

When using TensorFlow's tf.data.experimental.sample_from_datasets to equally sample from two very unbalanced Datasets, I end up getting a DirectedInterleave selected an exhausted input: 0 warning. Based on this GitHub issue, it appears that this is occurring when one of the Datasets inside the sample_from_datasets has been depleted of examples, and would need to sample already seen examples.

耗尽的数据集是否仍会产生样本(从而保持所需的平衡训练比率),或者数据集是否没有采样,因此训练再次变得不平衡?如果是后者,是否有一种方法可以使用 sample_from_datasets 产生所需的平衡训练比率?

Does the depleted dataset then still produce samples (thereby maintaining the desired balanced training ratio), or does the dataset not sample so the training once again becomes unbalanced? If the latter, is there a method to produce the desired balanced training ratio with sample_from_datasets?

注意:正在使用 TensorFlow 2 Beta

推荐答案

较小的数据集不会重复 - 一旦耗尽,其余部分将来自仍然有示例的较大数据集.

The smaller dataset does NOT repeat - once it is exhausted the remainder will just come from the larger dataset that still has examples.

您可以通过执行以下操作来验证此行为:

You can verify this behaviour by doing something like this:

def data1():
  for i in range(5):
    yield "data1-{}".format(i)

def data2():
  for i in range(10000):
    yield "data2-{}".format(i)

ds1 = tf.data.Dataset.from_generator(data1, tf.string)
ds2 = tf.data.Dataset.from_generator(data2, tf.string)

sampled_ds = tf.data.experimental.sample_from_datasets([ds2, ds1], seed=1)

然后,如果我们对 sampled_ds 进行迭代,我们会看到 data1 一旦耗尽就不会产生任何样本:

then if we iterate over sampled_ds we see that no samples from data1 are produced once it is exhausted:

tf.Tensor(b'data1-0', shape=(), dtype=string)
tf.Tensor(b'data2-0', shape=(), dtype=string)
tf.Tensor(b'data2-1', shape=(), dtype=string)
tf.Tensor(b'data2-2', shape=(), dtype=string)
tf.Tensor(b'data2-3', shape=(), dtype=string)
tf.Tensor(b'data2-4', shape=(), dtype=string)
tf.Tensor(b'data1-1', shape=(), dtype=string)
tf.Tensor(b'data1-2', shape=(), dtype=string)
tf.Tensor(b'data1-3', shape=(), dtype=string)
tf.Tensor(b'data2-5', shape=(), dtype=string)
tf.Tensor(b'data1-4', shape=(), dtype=string)
tf.Tensor(b'data2-6', shape=(), dtype=string)
tf.Tensor(b'data2-7', shape=(), dtype=string)
tf.Tensor(b'data2-8', shape=(), dtype=string)
tf.Tensor(b'data2-9', shape=(), dtype=string)
tf.Tensor(b'data2-10', shape=(), dtype=string)
tf.Tensor(b'data2-11', shape=(), dtype=string)
tf.Tensor(b'data2-12', shape=(), dtype=string)
...
---[no more 'data1-x' examples]--
...

当然,你可以用这样的方式重复data1:

Of course, you could make data1 repeat with something like this:

sampled_ds = tf.data.experimental.sample_from_datasets([ds2, ds1.repeat()], seed=1)

但从评论看来您已经意识到这一点,并且它不适用于您的场景.

but it seems from comments that you are aware of this and it doesn't work for your scenario.

如果是后者,有没有办法用 sample_from_datasets 产生所需的平衡训练比率?

If the latter, is there a method to produce the desired balanced training ratio with sample_from_datasets?

好吧,如果您有 2 个不同长度的数据集并且从中均匀采样,那么您似乎只有 2 个选择:

Well, if you have 2 datasets of differing lengths and you are sampling evenly from then it seems like you only have 2 choices:

  • 重复较小的数据集n次(其中n ≃ len(ds2)/len(ds1))
  • 一旦较小的数据集用完就停止采样

要实现第一个,您可以使用 ds1.repeat(n).

To achieve the first you can use ds1.repeat(n).

要实现第二个,您可以使用 ds2.take(m) 其中 m=len(ds1).

To achieve the second you could use ds2.take(m) where m=len(ds1).

这篇关于当收到“DirectedInterleave 选择耗尽输入"警告时,TensorFlow 的“sample_from_datasets"是否仍然从数据集中采样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆