TensorFlow 中的这种单热编码速度快吗?或者出于任何原因有缺陷? [英] Is this one-hot encoding in TensorFlow fast? Or flawed for any reason?

查看:28
本文介绍了TensorFlow 中的这种单热编码速度快吗?或者出于任何原因有缺陷?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于使用 TensorFlow 计算 one-hot 嵌入存在一些堆栈溢出问题,以下是公认的解决方案:

There are a few stack overflow questions about computing one-hot embeddings with TensorFlow, and here is the accepted solution:

num_labels = 10
sparse_labels = tf.reshape(label_batch, [-1, 1])
derived_size = tf.shape(label_batch)[0]
indices = tf.reshape(tf.range(0, derived_size, 1), [-1, 1])
concated = tf.concat(1, [indices, sparse_labels])
outshape = tf.reshape(tf.concat(0, [derived_size, [num_labels]]), [-1])
labels = tf.sparse_to_dense(concated, outshape, 1.0, 0.0)

这与官方教程中的代码几乎相同:https://www.tensorflow.org/versions/0.6.0/tutorials/mnist/tf/index.html

This is almost identical to the code in an official tutorial: https://www.tensorflow.org/versions/0.6.0/tutorials/mnist/tf/index.html

在我看来,由于 tf.nn.embedding_lookup 存在,它可能更有效.这是一个使用它的版本,它支持任意形状的输入:

To me it seems that since tf.nn.embedding_lookup exists, it's probably more efficient. Here's a version that uses this, and it supports arbitrarily-shaped inputs:

def one_hot(inputs, num_classes):
    with tf.device('/cpu:0'):
        table = tf.constant(np.identity(num_classes, dtype=np.float32))
        embeddings = tf.nn.embedding_lookup(table, inputs)
    return embeddings

您是否希望此实现更快?它是否因其他原因而存在缺陷?

Do you expect this implementation to be faster? And is it flawed for any other reason?

推荐答案

您问题中的 one_hot() 函数看起来是正确的.但是,我们不建议以这种方式编写代码的原因是它内存效率非常低.要了解原因,假设您的批次大小为 32,并且有 1,000,000 个类.

The one_hot() function in your question looks correct. However, the reason that we do not recommend writing code this way is that it is very memory inefficient. To understand why, let's say you have a batch size of 32, and 1,000,000 classes.

在问题中的one_hot()函数中,最大的张量将是np.identity(1000000)的结果,即4TB.当然,分配这个张量可能不会成功.即使类的数量少得多,显式存储所有这些零仍然会浪费内存——TensorFlow 不会自动将您的数据转换为稀疏表示,即使这样做可能有利可图.

In the one_hot() function in the question, the largest tensor will be the result of np.identity(1000000), which is 4 terabytes. Of course, allocating this tensor probably won't succeed. Even if the number of classes were much smaller, it would still waste memory to store all of those zeroes explicitly—TensorFlow does not automatically convert your data to a sparse representation even though it might be profitable to do so.

最后,我想为最近添加到开源存储库中的新功能提供一个插件,并将在下一个版本中提供.tf.nn.sparse_softmax_cross_entropy_with_logits() 允许您可以指定一个整数向量作为标签,从而使您不必构建密集的 one-hot 表示.对于大量类的任一解决方案应该更有效.

Finally, I want to offer a plug for a new function that was recently added to the open-source repository, and will be available in the next release. tf.nn.sparse_softmax_cross_entropy_with_logits() allows you to specify a vector of integers as the labels, and saves you from having to build the dense one-hot representation. It should be much more efficient that either solution for large numbers of classes.

这篇关于TensorFlow 中的这种单热编码速度快吗?或者出于任何原因有缺陷?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆