初始化词汇量(OOV)令牌 [英] Initializing Out of Vocabulary (OOV) tokens

查看:189
本文介绍了初始化词汇量(OOV)令牌的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为NLP任务构建TensorFlow模型,并且我使用的是经过预训练的手套300d字向量/嵌入数据集。

I am building TensorFlow model for NLP task, and I am using pretrained Glove 300d word-vector/embedding dataset.

显然,某些令牌无法解析为嵌入,因为未包含在词向量嵌入模型的训练数据集中,例如

Obviously some tokens can't be resolved as embeddings, because were not included into training dataset for word vector embedding model, e.g. rare names.

我可以将这些令牌替换为0的矢量,但是我宁愿以某种方式对其进行编码并将其包含在训练数据中,而不是将这些信息丢在地板上。

I can replace those tokens with vectors of 0s, but rather than dropping this information on the floor, I prefer to encode it somehow and include to my training data.

说,我有 raijin一词,它不能被解析为嵌入向量,与手套嵌入数据集一致地对其进行编码的最佳方法是什么?将其转换为300d向量的最佳方法是什么?

Say, I have 'raijin' word, which can't be resolved as embedding vector, what would be the best way to encode it consistently with Glove embedding dataset? What is the best approach to convert it to 300d vector?

谢谢。

推荐答案

不是将所有词汇量代币分配给一个通用变量 UNK 向量(零),最好为其分配一个唯一的随机向量。至少以这种方式,当您发现它们与其他任何单词之间的相似性时,它们中的每一个都是唯一的,并且模型可以从中学习到一些东西。在 UNK案例中,它们都是相同的,因此所有UNK单词将被视为具有相同的上下文。

Instead of assigning all the Out of Vocabulary tokens to a common UNK vector (zeros), it is better to assign them a unique random vector. At-least this way when you find the similarity between them with any other word, each of them will be unique and the model can learn something out of it. In the UNK case, they will all be same and so all the UNK words will be treated as having the same context.

我尝试了这种方法,并使用 LSTM Quora重复问题对 c数据集进行了3%的准确性改进>模型。

I tried this approach and got a 3% accuracy improvement on the Quora Duplicate question pair detection dataset using an LSTM model.

这篇关于初始化词汇量(OOV)令牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆