如何将 word2vec 导入 TensorFlow Seq2Seq 模型? [英] How to import word2vec into TensorFlow Seq2Seq model?

查看:59
本文介绍了如何将 word2vec 导入 TensorFlow Seq2Seq 模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Tensorflow 序列到序列翻译模型.我想知道是否可以将我自己的 word2vec 导入到这个模型中?而不是使用教程中提到的原始密集表示".

I am playing with Tensorflow sequence to sequence translation model. I was wondering if I could import my own word2vec into this model? Rather than using its original 'dense representation' mentioned in the tutorial.

从我的角度来看,TensorFlow 似乎对 seq2seq 模型使用 One-Hot 表示.首先,对于函数 tf.nn.seq2seq.embedding_attention_seq2seq,编码器的输入是一个标记化的符号,例如'a' 是 '4','dog' 是 '15715' 等等,并且需要 num_encoder_symbols.所以我认为它让我提供单词的位置和单词总数,然后该函数可以在 One-Hot 表示中表示单词.源码还在学习中,看不懂.

From my point of view, it looks TensorFlow is using One-Hot representation for seq2seq model. Firstly,for function tf.nn.seq2seq.embedding_attention_seq2seq the encoder's input is a tokenized symbol, e.g. 'a' would be '4' and 'dog' would be '15715' etc. and requires num_encoder_symbols. So I think it makes me provide the position of the word and the total number of words, then the function could represent the word in One-Hot representation. I am still learning the source code, but it hard to understand.

谁能给我一个关于上述问题的想法?

Could anyone give me an idea on above problem?

推荐答案

seq2seq embedding_* 函数确实创建了与 word2vec 非常相似的嵌入矩阵.它们是一个名为 sth 的变量,如下所示:

The seq2seq embedding_* functions indeed create embedding matrices very similar to those from word2vec. They are a variable named sth like this:

EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"

EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"

知道了这一点,你就可以修改这个变量了.我的意思是 - 以某种格式获取 word2vec 向量,比如文本文件.假设您在 model.vocab 中有词汇表,您可以按照下面的代码片段所示的方式分配读取向量(这只是一个代码片段,您必须对其进行更改才能使其工作,但我希望它显示了这个想法).

Knowing this, you can just modify this variable. I mean -- get your word2vec vectors in some format, say a text file. Assuming you have your vocabulary in model.vocab you can then assign the read vectors in a way illustrated by the snippet below (it's just a snippet, you'll have to change it to make it work, but I hope it shows the idea).

   vectors_variable = [v for v in tf.trainable_variables()
                        if EMBEDDING_KEY in v.name]
    if len(vectors_variable) != 1:
      print("Word vector variable not found or too many.")
      sys.exit(1)
    vectors_variable = vectors_variable[0]
    vectors = vectors_variable.eval()
    print("Setting word vectors from %s" % FLAGS.word_vector_file)
    with gfile.GFile(FLAGS.word_vector_file, mode="r") as f:
      # Lines have format: dog 0.045123 -0.61323 0.413667 ...
      for line in f:
        line_parts = line.split()
        # The first part is the word.
        word = line_parts[0]
        if word in model.vocab:
          # Remaining parts are components of the vector.
          word_vector = np.array(map(float, line_parts[1:]))
          if len(word_vector) != vec_size:
            print("Warn: Word '%s', Expecting vector size %d, found %d"
                     % (word, vec_size, len(word_vector)))
          else:
            vectors[model.vocab[word]] = word_vector
    # Assign the modified vectors to the vectors_variable in the graph.
    session.run([vectors_variable.initializer],
                {vectors_variable.initializer.inputs[1]: vectors})

这篇关于如何将 word2vec 导入 TensorFlow Seq2Seq 模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆