Word2Vec:使用的窗口大小的影响 [英] Word2Vec: Effect of window size used
问题描述
我正在尝试用非常短的短语(5克)训练word2vec模型.由于每个句子或示例都非常简短,因此我认为我可以使用的窗口大小最多可以为2.我试图了解这么小的窗口大小对学习模型的质量有何影响,以便使我能够理解我的模型是否学到了有意义的东西.我尝试在5克语言上训练word2vec模型,但看来学习的模型不能很好地捕捉语义等.
I am trying to train a word2vec model on very short phrases (5 grams). Since each sentence or example is very short, I believe the window size I can use can atmost be 2. I am trying to understand what the implications of such a small window size are on the quality of the learned model, so that I can understand whether my model has learnt something meaningful or not. I tried training a word2vec model on 5-grams but it appears the learnt model does not capture semantics etc very well.
我正在使用以下测试评估模型的准确性: https://code.google.com/p/word2vec /source/browse/trunk/questions-words.txt
I am using the following test to evaluate the accuracy of model: https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt
我使用gensim.Word2Vec训练模型,这是我的准确性得分的摘要(使用2个窗口大小)
I used gensim.Word2Vec to train a model and here is a snippet of my accuracy scores (using a window size of 2)
[{'correct': 2, 'incorrect': 304, 'section': 'capital-common-countries'},
{'correct': 2, 'incorrect': 453, 'section': 'capital-world'},
{'correct': 0, 'incorrect': 86, 'section': 'currency'},
{'correct': 2, 'incorrect': 703, 'section': 'city-in-state'},
{'correct': 123, 'incorrect': 183, 'section': 'family'},
{'correct': 21, 'incorrect': 791, 'section': 'gram1-adjective-to-adverb'},
{'correct': 8, 'incorrect': 544, 'section': 'gram2-opposite'},
{'correct': 284, 'incorrect': 976, 'section': 'gram3-comparative'},
{'correct': 67, 'incorrect': 863, 'section': 'gram4-superlative'},
{'correct': 41, 'incorrect': 951, 'section': 'gram5-present-participle'},
{'correct': 6, 'incorrect': 1089, 'section': 'gram6-nationality-adjective'},
{'correct': 171, 'incorrect': 1389, 'section': 'gram7-past-tense'},
{'correct': 56, 'incorrect': 936, 'section': 'gram8-plural'},
{'correct': 52, 'incorrect': 705, 'section': 'gram9-plural-verbs'},
{'correct': 835, 'incorrect': 9973, 'section': 'total'}]
我还尝试运行此处概述的demo-word-accuracy.sh脚本,其窗口大小为2,并且准确性也很差:
I also tried running the demo-word-accuracy.sh script outlined here with a window size of 2 and get poor accuracy as well:
Sample output:
capital-common-countries:
ACCURACY TOP1: 19.37 % (98 / 506)
Total accuracy: 19.37 % Semantic accuracy: 19.37 % Syntactic accuracy: -nan %
capital-world:
ACCURACY TOP1: 10.26 % (149 / 1452)
Total accuracy: 12.61 % Semantic accuracy: 12.61 % Syntactic accuracy: -nan %
currency:
ACCURACY TOP1: 6.34 % (17 / 268)
Total accuracy: 11.86 % Semantic accuracy: 11.86 % Syntactic accuracy: -nan %
city-in-state:
ACCURACY TOP1: 11.78 % (185 / 1571)
Total accuracy: 11.83 % Semantic accuracy: 11.83 % Syntactic accuracy: -nan %
family:
ACCURACY TOP1: 57.19 % (175 / 306)
Total accuracy: 15.21 % Semantic accuracy: 15.21 % Syntactic accuracy: -nan %
gram1-adjective-to-adverb:
ACCURACY TOP1: 6.48 % (49 / 756)
Total accuracy: 13.85 % Semantic accuracy: 15.21 % Syntactic accuracy: 6.48 %
gram2-opposite:
ACCURACY TOP1: 17.97 % (55 / 306)
Total accuracy: 14.09 % Semantic accuracy: 15.21 % Syntactic accuracy: 9.79 %
gram3-comparative:
ACCURACY TOP1: 34.68 % (437 / 1260)
Total accuracy: 18.13 % Semantic accuracy: 15.21 % Syntactic accuracy: 23.30 %
gram4-superlative:
ACCURACY TOP1: 14.82 % (75 / 506)
Total accuracy: 17.89 % Semantic accuracy: 15.21 % Syntactic accuracy: 21.78 %
gram5-present-participle:
ACCURACY TOP1: 19.96 % (198 / 992)
Total accuracy: 18.15 % Semantic accuracy: 15.21 % Syntactic accuracy: 21.31 %
gram6-nationality-adjective:
ACCURACY TOP1: 35.81 % (491 / 1371)
Total accuracy: 20.76 % Semantic accuracy: 15.21 % Syntactic accuracy: 25.14 %
gram7-past-tense:
ACCURACY TOP1: 19.67 % (262 / 1332)
Total accuracy: 20.62 % Semantic accuracy: 15.21 % Syntactic accuracy: 24.02 %
gram8-plural:
ACCURACY TOP1: 35.38 % (351 / 992)
Total accuracy: 21.88 % Semantic accuracy: 15.21 % Syntactic accuracy: 25.52 %
gram9-plural-verbs:
ACCURACY TOP1: 20.00 % (130 / 650)
Total accuracy: 21.78 % Semantic accuracy: 15.21 % Syntactic accuracy: 25.08 %
Questions seen / total: 12268 19544 62.77 %
但是word2vec网站声称可以在这些任务上获得〜60%的准确性. 因此,我想对诸如窗口大小之类的这些超参数的效果以及它们如何影响学习模型的质量获得一些见识.
However the word2vec site claims its possible to obtain an accuracy of ~60% on these tasks. Hence I would like to gain some insights into the effect of these hyperparameters like window size and how they affect quality of learnt models.
推荐答案
对于您的问题:我试图了解这么小的窗口大小对学习模型的质量有何影响".
To your question: "I am trying to understand what the implications of such a small window size are on the quality of the learned model".
例如,带有5个单词的程序员的stackoverflow优秀网站"(假设我们在此处保存了停用词great) 如果窗口大小为2,则单词"stackoverflow"的向量直接受单词"great"和"website"的影响;如果窗口大小为5,则"stackoverflow"可以直接受另外两个单词"for"和程序员".这里的受影响"意味着它将拉近两个单词的向量.
For example "stackoverflow great website for programmers" with 5 words (suppose we save the stop words great and for here) if the window size is 2 then the vector of word "stackoverflow" is directly affected by the word "great" and "website", if the window size is 5 "stackoverflow" can be directly affected by two more words "for" and "programmers". The 'affected' here means it will pull the vector of two words closer.
因此,这取决于您用于训练的材料,如果2的窗口大小可以捕获单词的上下文,但是选择5的窗口大小,则会降低学习模型的质量,反之亦然.
So it depends on the material you are using for training, if the window size of 2 can capture the context of a word, but 5 is chosen, it will decrease the quality of the learnt model, and vise versa.
这篇关于Word2Vec:使用的窗口大小的影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!