在 TensorFlow 中使用预训练的词嵌入(word2vec 或 Glove) [英] Using a pre-trained word embedding (word2vec or Glove) in TensorFlow

查看:72
本文介绍了在 TensorFlow 中使用预训练的词嵌入(word2vec 或 Glove)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近审查了 卷积文本分类.但是,我查看过的所有 TensorFlow 代码都使用如下所示的随机(未预训练)嵌入向量:

I've recently reviewed an interesting implementation for convolutional text classification. However all TensorFlow code I've reviewed uses a random (not pre-trained) embedding vectors like the following:

with tf.device('/cpu:0'), tf.name_scope("embedding"):
    W = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="W")
    self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
    self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

有谁知道如何使用 Word2vec 或 GloVe 预训练词嵌入的结果而不是随机嵌入的结果?

Does anybody know how to use the results of Word2vec or a GloVe pre-trained word embedding instead of a random one?

推荐答案

您可以通过多种方式在 TensorFlow 中使用预训练的嵌入.假设您在一个名为 embedding 的 NumPy 数组中嵌入了 vocab_size 行和 embedding_dim 列,并且您想要创建一个张量 W 可用于调用 tf.nn.embedding_lookup().

There are a few ways that you can use a pre-trained embedding in TensorFlow. Let's say that you have the embedding in a NumPy array called embedding, with vocab_size rows and embedding_dim columns and you want to create a tensor W that can be used in a call to tf.nn.embedding_lookup().

  1. 只需将 W 创建为 tf.constant()embedding 作为其值:

  1. Simply create W as a tf.constant() that takes embedding as its value:

W = tf.constant(embedding, name="W")

这是最简单的方法,但它的内存效率不高,因为 tf.constant() 的值在内存中存储了多次.由于 embedding 可能非常大,因此您应该仅将这种方法用于玩具示例.

This is the easiest approach, but it is not memory efficient because the value of a tf.constant() is stored multiple times in memory. Since embedding can be very large, you should only use this approach for toy examples.

创建 W 作为 tf.Variable 并通过 tf.placeholder():

Create W as a tf.Variable and initialize it from the NumPy array via a tf.placeholder():

W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
                trainable=False, name="W")

embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
embedding_init = W.assign(embedding_placeholder)

# ...
sess = tf.Session()

sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})

这避免了在图中存储 embedding 的副本,但它确实需要足够的内存来同时在内存中保存矩阵的两个副本(一个用于 NumPy 数组,一个用于 <代码>tf.Variable).请注意,我假设您希望在训练期间保持嵌入矩阵不变,因此 W 是使用 trainable=False 创建的.

This avoid storing a copy of embedding in the graph, but it does require enough memory to keep two copies of the matrix in memory at once (one for the NumPy array, and one for the tf.Variable). Note that I've assumed that you want to hold the embedding matrix constant during training, so W is created with trainable=False.

如果嵌入是作为另一个 TensorFlow 模型的一部分进行训练的,您可以使用 tf.train.Saver 从另一个模型的检查点文件加载值.这意味着嵌入矩阵可以完全绕过 Python.按照选项 2 创建 W,然后执行以下操作:

If the embedding was trained as part of another TensorFlow model, you can use a tf.train.Saver to load the value from the other model's checkpoint file. This means that the embedding matrix can bypass Python altogether. Create W as in option 2, then do the following:

W = tf.Variable(...)

embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W})

# ...
sess = tf.Session()
embedding_saver.restore(sess, "checkpoint_filename.ckpt")

这篇关于在 TensorFlow 中使用预训练的词嵌入(word2vec 或 Glove)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆