如何为Tensorflow中的未知单词添加新的嵌入(训练和预设测试) [英] How to add new embeddings for unknown words in Tensorflow (training & pre-set for testing)

查看:102
本文介绍了如何为Tensorflow中的未知单词添加新的嵌入(训练和预设测试)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很好奇,当遇到预训练词汇表中未知的单词时,如何添加正态随机化的300维向量(元素类型= tf.float32).我正在使用经过预先训练的GloVe词嵌入,但是在某些情况下,我意识到遇到了未知词,并且我想为这个新发现的未知词创建一个正规随机的词向量.

I am curious as to how I can add a normal-randomized 300 dimension vector (elements' type = tf.float32) whenever a word unknown to the pre-trained vocabulary is encountered. I am using pre-trained GloVe word embeddings, but in some cases, I realize I encounter unknown words, and I want to create a normal-randomized word vector for this new found unknown word.

问题是,在当前设置下,我使用的是 tf. contrib.lookup.index_table_from_tensor 根据已知词汇从单词转换为整数.此函数可以创建新标记并将它们散列为词汇表单词中的预定义数量,但是我的embed将不包含此新未知散列值的嵌入.我不确定是否可以简单地将随机嵌入添加到embed列表的末尾.

The problem is that with my current set up, I use tf.contrib.lookup.index_table_from_tensor to convert from words to integers based on the known vocabulary. This function can create new tokens and hash them for some predefined number of out of vocabulary words, but my embed will not contain an embedding for this new unknown hash value. I am uncertain if I can simply append a randomized embedding to the end of the embed list.

我也想以一种有效的方式做到这一点,因此预构建的tensorflow函数或涉及tensorflow函数的方法可能是最有效的.我定义了一些众所周知的特殊标记,例如句子结尾标记和默认未知标记(如空字符串)(位于索引0),但这在学习各种不同未知单词的能力上受到了限制.我目前使用 tf.nn.embedding_lookup()作为最后的嵌入步骤.

I also would like to do this in an efficient way, so pre-built tensorflow function or method involving tensorflow functions would probably be the most efficient. I define pre-known special tokens such as an end of sentence token and a default unknown as the empty string ("at index 0), but this is limited in its power to learn for various different unknown words. I currently use tf.nn.embedding_lookup() as the final embedding step.

我希望能够为训练数据中的每个未知单词添加新的随机300d向量,并且我还希望为训练期间未遇到的任何未知标记添加预制的随机单词向量.测试.最有效的方法是什么?

I would like to be able to add new random 300d vectors for each unknown word in the training data, and I would also like to add pre-made random word vectors for any unknown tokens not seen in training that is possibly encountered during testing. What is the most efficient way of doing this?

def embed_tensor(string_tensor, trainable=True):
    """    
    Convert List of strings into list of indicies then into 300d vectors
    """
    # ordered lists of vocab and corresponding (by index) 300d vector
    vocab, embed = load_pretrained_glove()

    # Set up tensorflow look up from string word to unique integer
    vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
        mapping=tf.constant(vocab),
        default_value = 0)
    string_tensor = vocab_lookup.lookup(string_tensor)

    # define the word embedding 
    embedding_init = tf.Variable(tf.constant(np.asarray(embed),
                                 dtype=tf.float32),
                                 trainable=trainable,
                                 name="embed_init")

    # return the word embedded version of the sentence (300d vectors/word)
    return tf.nn.embedding_lookup(embedding_init, string_tensor)

推荐答案

下面的代码示例对您的embed_tensor函数进行了调整,以使单词嵌入如下:

The code example below adapts your embed_tensor function such that words are embedded as follows:

  • 对于具有预训练嵌入的单词,将使用预训练嵌入来初始化嵌入.如果trainableFalse,则在训练过程中可以使嵌入保持固定状态.
  • 对于训练数据中没有预训练嵌入的单词,将随机初始化嵌入.如果trainableFalse,则在训练过程中可以使嵌入保持固定状态.
  • 对于测试数据中未出现且未进行预训练嵌入的单词,将使用单个随机初始化的嵌入向量.此向量无法训练.
  • For words that have a pretrained embedding, the embedding is initialized with the pretrained embedding. The embedding can be kept fixed during training if trainable is False.
  • For words in the training data that don't have a pretrained embedding, the embedding is initialized randomly. The embedding can be kept fixed during training if trainable is False.
  • For words in the test data that don't occur in the training data and don't have a pretrained embedding, a single randomly initialized embedding vector is used. This vector can't be trained.
import tensorflow as tf
import numpy as np

EMB_DIM = 300
def load_pretrained_glove():
    return ["a", "cat", "sat", "on", "the", "mat"], np.random.rand(6, EMB_DIM)

def get_train_vocab():
    return ["a", "dog", "sat", "on", "the", "mat"]

def embed_tensor(string_tensor, trainable=True):
  """
  Convert List of strings into list of indices then into 300d vectors
  """
  # ordered lists of vocab and corresponding (by index) 300d vector
  pretrained_vocab, pretrained_embs = load_pretrained_glove()
  train_vocab = get_train_vocab()
  only_in_train = list(set(train_vocab) - set(pretrained_vocab))
  vocab = pretrained_vocab + only_in_train

  # Set up tensorflow look up from string word to unique integer
  vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
    mapping=tf.constant(vocab),
    default_value=len(vocab))
  string_tensor = vocab_lookup.lookup(string_tensor)

  # define the word embedding
  pretrained_embs = tf.get_variable(
      name="embs_pretrained",
      initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
      shape=pretrained_embs.shape,
      trainable=trainable)
  train_embeddings = tf.get_variable(
      name="embs_only_in_train",
      shape=[len(only_in_train), EMB_DIM],
      initializer=tf.random_uniform_initializer(-0.04, 0.04),
      trainable=trainable)
  unk_embedding = tf.get_variable(
      name="unk_embedding",
      shape=[1, EMB_DIM],
      initializer=tf.random_uniform_initializer(-0.04, 0.04),
      trainable=False)

  embeddings = tf.concat([pretrained_embs, train_embeddings, unk_embedding], axis=0)

  return tf.nn.embedding_lookup(embeddings, string_tensor)

仅供参考,要对在训练数据中没有出现且没有预先训练的单词有一个合理的,非随机的表示,可以考虑将训练数据中频率较低的单词映射到 unk令牌(不在您的词汇表中),并使unk_embedding可训练.这样,您就可以为训练数据中看不到的单词学习原型.

FYI, to have a sensible, non-random representation for words that don't occur in the training data and don't have a pretrained embedding, you could consider mapping words with a low frequency in your training data to an unk token (that is not in your vocabulary) and make the unk_embedding trainable. This way you learn a prototype for words that are unseen in the training data.

这篇关于如何为Tensorflow中的未知单词添加新的嵌入(训练和预设测试)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆