如何在Gensim上使用预训练的模型对单词和短语进行聚类 [英] How to Cluster words and phrases with pre-trained model on Gensim
问题描述
我想要的是将单词和短语进行聚类,例如 编织/编织机/织机编织/织机/彩虹织机/家庭装饰配件/织机/编织机/...而我只有一个词/短语,却没有语料库.我可以使用像GoogleNews/Wikipedia/...这样的预训练模型来实现它吗?
What I want exactly is to cluster words and phrases, e.g. knitting/knit loom/loom knitting/weaving loom/rainbow loom/home decoration accessories/loom knit/knitting loom/...And I don'd have corpus while I have only the words/phrases. Could I use a pre-trained model like the one from GoogleNews/Wikipedia/... to realise it?
我现在正在尝试使用Gensim加载GoogleNews预先训练的模型,以获取短语相似性.有人告诉我GoogleNews模型包括短语和单词的向量.但是我发现我只能得到词相似度,而短语相似度会失败,并显示一条错误消息,指出该短语不在词汇表中.请给我提意见.谢谢.
I am trying now to use Gensim to load GoogleNews pre-trained model to get phrases similarity. I've been told that The GoogleNews model includes vectors of phrases and words. But I find that I could only get word-similarity while phrase-similarity fails with an error message that the phrase is not in the vocabulary. Please advise me. Thank you.
import gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
GOOGLE_MODEL = '../GoogleNews-vectors-negative300.bin'
model = gensim.models.KeyedVectors.load_word2vec_format(GOOGLE_MODEL, binary=True)
# done well
model.most_similar("computer", topn=3)
# done with error message "computer_software" is not in the vocabulory.
model.most_similar("computer_software", topn=3)
推荐答案
GoogleNews
集合确实包含许多通过统计分析创建的多词短语,但可能不包含您希望它做的特定事情,像'computer_software'
.
The GoogleNews
set does include many multi-word phrases, as created via some statistical analysis, but might not include something specific you're hoping it does, like 'computer_software'
.
另一方面,我看到一个在线单词列表,提示GoogleNews
词汇中的词如'composite_fillings'
,因此这可能对您有用:
On the other hand, I see an online word-list suggesting that a phrase like 'composite_fillings'
is in the GoogleNews
vocabulary, so this will likely work for you:
model.most_similar("composite_fillings", topn=3)
有了该向量集,您将被限制在他们选择建模为短语的对象上.如果您需要其他词组具有类似强度的向量,则可能需要在语料库上训练自己的模型,该语料库将对您重要的词组组合为单个标记. (如果您只需要比不做的好,将构成单词的单词向量平均起来将为您提供一些有用的东西……但这是一个非常粗暴的替代品,可以根据其实际对bigram/multigram进行建模独特的上下文.)
With that vector-set, you're limited to what they chose to model as phrases. If you need similarly-strong vectors for other phrases, you'd likely need to train your own model, on a corpus where the phrases important to you have been combined into single tokens. (If you just need something-better-than-nothing, averaging together the constituent words' word-vectors would give you something to work with... but that's a pretty-crude stand-in for truly modeling the bigram/multigram against its unique contexts.)
这篇关于如何在Gensim上使用预训练的模型对单词和短语进行聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!