Word2Vec:使用的窗口大小的影响 [英] Word2Vec: Effect of window size used

查看：1311 发布时间：2020/11/13 6:11:22 gensim word2vec

本文介绍了Word2Vec:使用的窗口大小的影响的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试用非常短的短语(5克)训练word2vec模型.由于每个句子或示例都非常简短，因此我认为我可以使用的窗口大小最多可以为2.我试图了解这么小的窗口大小对学习模型的质量有何影响，以便使我能够理解我的模型是否学到了有意义的东西.我尝试在5克语言上训练word2vec模型，但看来学习的模型不能很好地捕捉语义等.

I am trying to train a word2vec model on very short phrases (5 grams). Since each sentence or example is very short, I believe the window size I can use can atmost be 2. I am trying to understand what the implications of such a small window size are on the quality of the learned model, so that I can understand whether my model has learnt something meaningful or not. I tried training a word2vec model on 5-grams but it appears the learnt model does not capture semantics etc very well.

我正在使用以下测试评估模型的准确性: https://code.google.com/p/word2vec /source/browse/trunk/questions-words.txt

I am using the following test to evaluate the accuracy of model: https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt

我使用gensim.Word2Vec训练模型，这是我的准确性得分的摘要(使用2个窗口大小)

I used gensim.Word2Vec to train a model and here is a snippet of my accuracy scores (using a window size of 2)

[{'correct': 2, 'incorrect': 304, 'section': 'capital-common-countries'},
 {'correct': 2, 'incorrect': 453, 'section': 'capital-world'},
 {'correct': 0, 'incorrect': 86, 'section': 'currency'},
 {'correct': 2, 'incorrect': 703, 'section': 'city-in-state'},
 {'correct': 123, 'incorrect': 183, 'section': 'family'},
 {'correct': 21, 'incorrect': 791, 'section': 'gram1-adjective-to-adverb'},
 {'correct': 8, 'incorrect': 544, 'section': 'gram2-opposite'},
 {'correct': 284, 'incorrect': 976, 'section': 'gram3-comparative'},
 {'correct': 67, 'incorrect': 863, 'section': 'gram4-superlative'},
 {'correct': 41, 'incorrect': 951, 'section': 'gram5-present-participle'},
 {'correct': 6, 'incorrect': 1089, 'section': 'gram6-nationality-adjective'},
 {'correct': 171, 'incorrect': 1389, 'section': 'gram7-past-tense'},
 {'correct': 56, 'incorrect': 936, 'section': 'gram8-plural'},
 {'correct': 52, 'incorrect': 705, 'section': 'gram9-plural-verbs'},
 {'correct': 835, 'incorrect': 9973, 'section': 'total'}]

我还尝试运行此处概述的demo-word-accuracy.sh脚本，其窗口大小为2，并且准确性也很差:

I also tried running the demo-word-accuracy.sh script outlined here with a window size of 2 and get poor accuracy as well:

Sample output:
    capital-common-countries:
    ACCURACY TOP1: 19.37 %  (98 / 506)
    Total accuracy: 19.37 %   Semantic accuracy: 19.37 %   Syntactic accuracy: -nan % 
    capital-world:
    ACCURACY TOP1: 10.26 %  (149 / 1452)
    Total accuracy: 12.61 %   Semantic accuracy: 12.61 %   Syntactic accuracy: -nan % 
    currency:
    ACCURACY TOP1: 6.34 %  (17 / 268)
    Total accuracy: 11.86 %   Semantic accuracy: 11.86 %   Syntactic accuracy: -nan % 
    city-in-state:
    ACCURACY TOP1: 11.78 %  (185 / 1571)
    Total accuracy: 11.83 %   Semantic accuracy: 11.83 %   Syntactic accuracy: -nan % 
    family:
    ACCURACY TOP1: 57.19 %  (175 / 306)
    Total accuracy: 15.21 %   Semantic accuracy: 15.21 %   Syntactic accuracy: -nan % 
    gram1-adjective-to-adverb:
    ACCURACY TOP1: 6.48 %  (49 / 756)
    Total accuracy: 13.85 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 6.48 % 
    gram2-opposite:
    ACCURACY TOP1: 17.97 %  (55 / 306)
    Total accuracy: 14.09 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 9.79 % 
    gram3-comparative:
    ACCURACY TOP1: 34.68 %  (437 / 1260)
    Total accuracy: 18.13 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 23.30 % 
    gram4-superlative:
    ACCURACY TOP1: 14.82 %  (75 / 506)
    Total accuracy: 17.89 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 21.78 % 
    gram5-present-participle:
    ACCURACY TOP1: 19.96 %  (198 / 992)
    Total accuracy: 18.15 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 21.31 % 
    gram6-nationality-adjective:
    ACCURACY TOP1: 35.81 %  (491 / 1371)
    Total accuracy: 20.76 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 25.14 % 
    gram7-past-tense:
    ACCURACY TOP1: 19.67 %  (262 / 1332)
    Total accuracy: 20.62 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 24.02 % 
    gram8-plural:
    ACCURACY TOP1: 35.38 %  (351 / 992)
    Total accuracy: 21.88 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 25.52 % 
    gram9-plural-verbs:
    ACCURACY TOP1: 20.00 %  (130 / 650)
    Total accuracy: 21.78 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 25.08 % 
    Questions seen / total: 12268 19544   62.77 %

但是word2vec网站声称可以在这些任务上获得〜60％的准确性. 因此，我想对诸如窗口大小之类的这些超参数的效果以及它们如何影响学习模型的质量获得一些见识.

However the word2vec site claims its possible to obtain an accuracy of ~60% on these tasks. Hence I would like to gain some insights into the effect of these hyperparameters like window size and how they affect quality of learnt models.

Word2Vec:使用的窗口大小的影响 [英] Word2Vec: Effect of window size used

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Word2Vec:使用的窗口大小的影响 [英] Word2Vec: Effect of window size used

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭