什么是词向量维数 [英] what is word vector dimension

查看:22
本文介绍了什么是词向量维数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前是深度学习的业余爱好者,正在本网站上阅读有关 word2vector 的内容https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors

I am currently an amateur in deep learning and was reading about word2vector on this site https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-3-more-fun-with-word-vectors

对于 CBOW 或 skipgram 模型,我可以看到词向量的维度是 300,词汇量是 15000.我在之前的文章中读到,我们可以对向量中的词进行硬编码.所以我猜词向量维度应该等于词汇量大小,或者换个方式来提问,这个词维度是什么以及如何可视化它.你如何看待这个维度?

For either CBOW or skipgram model, I can see that the dimension of the word vectors is 300 and the vocabulary size is 15000. What I have read in the earlier post, is that we can one hard encode the words in vectors. So I guess the word vector dimension should be equal to the vocabulary size or to put the question in a different way, what is this word dimension and how to do visualize it. How do you take this dimension?

推荐答案

Word Vector Dimension"是您使用训练文档训练的向量的维度.从技术上讲,您可以选择任何维度,例如 10、100、300,甚至 1000.行业标准是 300-500,因为我们已经尝试过不同的维度(300、400、500、... 1000 等)但没有注意到300-400后的显着性能提升.(这也取决于你的训练数据.)听起来,更多的维度意味着更重的计算.但是,如果我们将维度设置得太低,则没有太多的向量空间来捕获整个训练文档包含的信息.

"Word Vector Dimension" is the dimension of the vector that you have trained with the training document. Technically you can choose any dimension, like 10, 100, 300, even 1000. Industry norm is 300-500 because we have experimented with different dimensions (300, 400, 500, ... 1000, etc.) but haven't noticed the significant performance improvement after 300-400. (This also depends on your training data.) As it sounds, more dimension means heavier computation. However, if we set the dimension too low, then there is not much vector space to capture the information that the entire training document contains.

如何可视化?

您无法轻松地可视化 300 维向量,而且可视化 300 维向量可能对您没有太大用处.我们能做的就是将这些向量投影到二维空间,也就是我们最熟悉也最容易理解的空间.

You can't easily visualize 300-dimensional vector and probably visualizing 300-d vectors isn't too useful to you. What we can do is to project those vectors to 2-d space, the space that we are most familiar with and that we can understand easily.

你的最后一句话所以我猜词向量维度应该等于词汇量是错误的!词汇大小为 171,476 个单词(英语单词总数)!词向量维度(主要是 300-500.你不想训练 10 亿维向量,是吗?)是你预先决定用来训练数据的向量的大小.我的视频(无耻插件)将帮助您理解重要的词向量概念:AI with the Best

Your last statement So I guess the word vector dimension should be equal to the vocabulary size is WRONG! Vocab size is 171,476 words (total # of words in English)! Word vector dimension (mostly 300-500. You don't want to train 1-billion-dimensional vectors, do you?) is the size of vector you decide in advance to train the data. My video (shameless plug) will help you to understand the important word vector concepts: AI with the Best

这篇关于什么是词向量维数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆