初始化词汇量(OOV)令牌 [英] Initializing Out of Vocabulary (OOV) tokens
问题描述
我正在为NLP任务构建TensorFlow模型,并且我使用的是经过预训练的手套300d字向量/嵌入数据集。
I am building TensorFlow model for NLP task, and I am using pretrained Glove 300d word-vector/embedding dataset.
显然,某些令牌无法解析为嵌入,因为未包含在词向量嵌入模型的训练数据集中,例如
Obviously some tokens can't be resolved as embeddings, because were not included into training dataset for word vector embedding model, e.g. rare names.
我可以将这些令牌替换为0的矢量,但是我宁愿以某种方式对其进行编码并将其包含在训练数据中,而不是将这些信息丢在地板上。
I can replace those tokens with vectors of 0s, but rather than dropping this information on the floor, I prefer to encode it somehow and include to my training data.
说,我有 raijin一词,它不能被解析为嵌入向量,与手套嵌入数据集一致地对其进行编码的最佳方法是什么?将其转换为300d向量的最佳方法是什么?
Say, I have 'raijin' word, which can't be resolved as embedding vector, what would be the best way to encode it consistently with Glove embedding dataset? What is the best approach to convert it to 300d vector?
谢谢。
推荐答案
不是将所有词汇量代币分配给一个通用变量
UNK
向量(零),最好为其分配一个唯一的随机向量。至少以这种方式,当您发现它们与其他任何单词之间的相似性时,它们中的每一个都是唯一的,并且模型可以从中学习到一些东西。在 UNK案例
中,它们都是相同的,因此所有UNK单词将被视为具有相同的上下文。
Instead of assigning all the Out of Vocabulary
tokens to a common UNK
vector (zeros), it is better to assign them a unique random vector. At-least this way when you find the similarity between them with any other word, each of them will be unique and the model can learn something out of it. In the UNK case
, they will all be same and so all the UNK words will be treated as having the same context.
我尝试了这种方法,并使用 LSTM
Quora重复问题对 c数据集进行了3%的准确性改进>模型。
I tried this approach and got a 3% accuracy improvement on the Quora Duplicate question pair detection
dataset using an LSTM
model.
这篇关于初始化词汇量(OOV)令牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!