Word2Vec:维数 [英] Word2Vec: Number of Dimensions

查看:35
本文介绍了Word2Vec:维数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Word2Vec 和大约 11,000,000 个标记的数据集,希望同时进行两个词的相似性(作为下游任务的同义词提取的一部分),但我不太清楚应该与 Word2Vec 一起使用多少维.有没有人根据标记/句子的数量对要考虑的维度范围有很好的启发?

I am using Word2Vec with a dataset of roughly 11,000,000 tokens looking to do both word similarity (as part of synonym extraction for a downstream task) but I don't have a good sense of how many dimensions I should use with Word2Vec. Does anyone have a good heuristic for the range of dimensions to consider based on the number of tokens/sentences?

推荐答案

典型的间隔在 100-300 之间.我会说你至少需要 50D 才能达到最低的精度.如果您选择较少的维度,您将开始失去高维空间的属性.如果训练时间对您的应用程序来说不是什么大问题,我会坚持使用 200D 尺寸,因为它提供了很好的功能.使用 300D 可以获得极高的精度.300D后的词特征不会有显着提升,训练会极其缓慢.

Typical interval is between 100-300. I would say you need at least 50D to achieve lowest accuracy. If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. Extreme accuracy can be obtained with 300D. After 300D word features won't improve dramatically, and training will be extremely slow.

我不知道高维空间中维度选择的理论解释和严格界限(并且可能没有独立于应用程序的解释),但我建议您参考 彭宁顿等.al,图2a 其中x 轴表示向量维度,y 轴表示获得的精度.这应该为上述论点提供实证依据.

I do not know theoretical explanation and strict bounds of dimension selection in high dimensional spaces (and there might not a application-independent explanation for that), but I would refer you to Pennington et. al, Figure2a where x axis shows vector dimension and y axis shows the accuracy obtained. That should provide empirical justification to above argument.

这篇关于Word2Vec:维数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆