Word2Vec:尺寸数 [英] Word2Vec: Number of Dimensions

查看:97
本文介绍了Word2Vec:尺寸数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将Word2Vec与大约11,000,000个令牌的数据集一起使用,以实现两个单词的相似性(作为下游任务的同义词提取的一部分),但是我对应该在Word2Vec中使用多少个维度不太了解.有没有人有尺寸的范围内具有良好的启发式考虑基于对令牌/句子的数量?

I am using Word2Vec with a dataset of roughly 11,000,000 tokens looking to do both word similarity (as part of synonym extraction for a downstream task) but I don't have a good sense of how many dimensions I should use with Word2Vec. Does anyone have a good heuristic for the range of dimensions to consider based on the number of tokens/sentences?

推荐答案

典型间隔在100-300之间.我会说,你至少需要50D实现最低准确性.如果你选择较小的维数,你会开始失去高维空间的特性.如果训练时间不是什么大不了的申请,我将与200D尺寸坚持,因为它提供不错的功能.使用300D可获得极高的精度.之后300D字功能将不会显着提高,与培训将是极其缓慢.

Typical interval is between 100-300. I would say you need at least 50D to achieve lowest accuracy. If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. Extreme accuracy can be obtained with 300D. After 300D word features won't improve dramatically, and training will be extremely slow.

我不知道理论解释,并在高维空间维度选择的严格界限(也有可能不应该是一个独立于应用程序的解释),但我会向您推荐的 Pennington等.人的,Figure2a其中x轴表示矢量维数和Y轴显示获得的精度.这应该为上述论点提供经验依据.

I do not know theoretical explanation and strict bounds of dimension selection in high dimensional spaces (and there might not a application-independent explanation for that), but I would refer you to Pennington et. al, Figure2a where x axis shows vector dimension and y axis shows the accuracy obtained. That should provide empirical justification to above argument.

这篇关于Word2Vec:尺寸数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆