Gensim most_like()与Fasttext单词向量一起返回无用/无意义的单词 [英] Gensim most_similar() with Fasttext word vectors return useless/meaningless words

查看:187
本文介绍了Gensim most_like()与Fasttext单词向量一起返回无用/无意义的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将Gensim与 Fasttext Word vectors 一起使用,以返回相似的单词

I'm using Gensim with Fasttext Word vectors for return similar words.

这是我的代码:

import gensim

model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')

words = model.most_similar(positive=['sole'],topn=10)

print(words)

这将返回:

[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]

问题是"sole"(英语中的"sun")返回一系列带有点的单词(例如sole.,sole.Ma,ecc ...).问题出在哪儿?为什么most_like会返回这个毫无意义的词?

The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?

编辑

我尝试使用英语单词向量,单词"sun"返回这个:

I tried with english word vector and the word "sun" return this:

[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 

是否不可能复制诸如relatedwords.org之类的结果?

Is it impossible to reproduce results like relatedwords.org?

推荐答案

也许更大的问题是:为什么Facebook FastText cc.it.300.vec模型包含这么多无意义的单词? (我以前没有注意到–您是否有可能下载了一个特殊的模型,该模型用额外的分析标记修饰了单词?)

Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)

要获得FastText的独特优势-包括为词汇量不大的单词合成合理的矢量(胜过一切)的能力-您可能不希望在纯文本上使用通用的load_word2vec_format() .vec文件,而是.bin文件上特定于Facebook-FastText的加载方法.参见:

To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:

https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors

(我不确定这是否会对这些结果有所帮助,但是如果选择使用FastText,则完全"使用它可能会很有趣.)

(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)

最后,鉴于该培训的来源–来自开放网络的常见抓取文字,其中可能包含很多错别字/垃圾-这些可能是合法的单词样记号,本质上是sole的错别字,经常显示得足够多在训练数据中获取单词向量. (并且因为它们实际上是'sole'的拼写同义词,所以它们不一定对所有目的都是不好的结果,只是出于您只希望看到真实的"字眼的目的.)

Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)

您可能会发现尝试使用most_similar()restrict_vocab自变量仅接收来自所有已知单词向量的前导(最频繁)部分的结果会有所帮助.例如,仅从前50000个单词中获取结果:

You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:

words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)

restrict_vocab选择正确的值可能在实践中有助于省略长尾垃圾"字词,同时仍提供您要查找的真实/常见相似字词.

Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

这篇关于Gensim most_like()与Fasttext单词向量一起返回无用/无意义的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆