通过在 Tensorflow 中获取其所有词嵌入的平均值来获得句子嵌入? [英] Obtaining sentence embedding by getting the mean of all its word embeddings in Tensorflow?

查看:37
本文介绍了通过在 Tensorflow 中获取其所有词嵌入的平均值来获得句子嵌入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的代码,用于拆分类型为 tf.string 的输入张量并使用预训练的 GloVe 模型提取其每个词嵌入.但是,我遇到了关于 cond 实现的无端错误.我想知道是否有一种更简洁的方法来获得 字符串张量 中所有单词的嵌入.

Here is my code for splitting the input Tensor with type tf.string and extracting each of its word embeddings using a pre-trained GloVe model. However, I get unwarranted errors regarding the cond implementation. I wonder if there is a cleaner way to obtain embeddings for all words in a string tensor.

# Take out the words
target_words = tf.string_split([target_sentence], delimiter=" ")

# Tensorflow parallel while loop variable, condition and body
i = tf.constant(0, dtype=tf.int32)
cond = lambda self, i: tf.less(x=tf.cast(i, tf.int32), y=tf.cast(tf.shape(target_words)[0], tf.int32))
sentence_mean_embedding = tf.Variable([], trainable=False)

def body(i, sentence_mean_embedding):
    sentence_mean_embedding = tf.concat(1, tf.nn.embedding_lookup(params=tf_embedding, ids=tf.gather(target_words, i)))

    return sentence_mean_embedding

embedding_sentence = tf.reduce_mean(tf.while_loop(cond, body, [i, sentence_mean_embedding]))

推荐答案

index_table_from_file数据集 API.

首先,创建您自己的tf.Dataset(我假设我们有两个带有任意标签的句子):

First, create your own tf.Dataset (I assume we have two sentences with some arbitary labels):

sentence = tf.constant(['this is first sentence', 'this is second sentence'])
labels = tf.constant([1, 0])
dataset = tf.data.Dataset.from_tensor_slices((sentence, labels))

第二,创建一个 vocab.txt 文件,该文件中的每一行的编号都映射到 Glove 嵌入中的相同索引.例如,如果 Glove 中的第一个词汇在 vocab.txt 中是absent",那么第一行应该absent"等等.为简单起见,假设我们的 vocab.txt 包含以下单词:

Second, create a vocab.txt file that each line's number in this file maps to the same index in the Glove embedding. For example, if the first vocabulary in Glove is "absent" in vocab.txt the first line should "absent" and so on. For simplicity, assume our vocab.txt contains the following words:

first
is
test
this
second
sentence

然后,基于这里,定义一个表它的目标是将每个单词转换为特定的 id:

Then, based on here, define a table that its goal is to convert each word to specific id:

table = tf.contrib.lookup.index_table_from_file(vocabulary_file="vocab.txt", num_oov_buckets=1)
dataset = dataset.map(lambda x, y: (tf.string_split([x]).values, y))
dataset = dataset.map(lambda x, y: (tf.cast(table.lookup(x), tf.int32), y))

dataset = dataset.batch(1)

最后,基于这个答案,通过使用nn.embedding_lookup() 将每个句子转换为嵌入:

Finally, based on this answer, by using nn.embedding_lookup() convert each sentence to embedding:

glove_weights = tf.get_variable('embed', shape=embedding.shape, initializer=initializer=tf.constant_initializer(embedding), trainable=False)

iterator = dataset.make_initializable_iterator()
x, y = iterator.get_next()

embedding = tf.nn.embedding_lookup(glove_weights, x)
sentence = tf.reduce_mean(embedding, axis=1)

<小时>

在eager模式下完成代码:


Complete code in eager mode:

import tensorflow as tf

tf.enable_eager_execution()

sentence = tf.constant(['this is first sentence', 'this is second sentence'])
labels = tf.constant([1, 0])

dataset = tf.data.Dataset.from_tensor_slices((sentence, labels))
table = tf.contrib.lookup.index_table_from_file(vocabulary_file="vocab.txt", num_oov_buckets=1)
dataset = dataset.map(lambda x, y: (tf.string_split([x]).values, y))
dataset = dataset.map(lambda x, y: (tf.cast(table.lookup(x), tf.int32), y))

dataset = dataset.batch(1)

glove_weights = tf.get_variable('embed', shape=(10000, 300), initializer=tf.truncated_normal_initializer())

for x, y in dataset:
    embedding = tf.nn.embedding_lookup(glove_weights, x)
    sentence = tf.reduce_mean(embedding, axis=1)
    print(sentence.shape)

这篇关于通过在 Tensorflow 中获取其所有词嵌入的平均值来获得句子嵌入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆