如何在Gensim上使用预训练的模型对单词和短语进行聚类 [英] How to Cluster words and phrases with pre-trained model on Gensim

查看:309
本文介绍了如何在Gensim上使用预训练的模型对单词和短语进行聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要的是将单词和短语进行聚类,例如 编织/编织机/织机编织/织机/彩虹织机/家庭装饰配件/织机/编织机/...而我只有一个词/短语,却没有语料库.我可以使用像GoogleNews/Wikipedia/...这样的预训练模型来实现它吗?

What I want exactly is to cluster words and phrases, e.g. knitting/knit loom/loom knitting/weaving loom/rainbow loom/home decoration accessories/loom knit/knitting loom/...And I don'd have corpus while I have only the words/phrases. Could I use a pre-trained model like the one from GoogleNews/Wikipedia/... to realise it?

我现在正在尝试使用Gensim加载GoogleNews预先训练的模型,以获取短语相似性.有人告诉我GoogleNews模型包括短语和单词的向量.但是我发现我只能得到词相似度,而短语相似度会失败,并显示一条错误消息,指出该短语不在词汇表中.请给我提意见.谢谢.

I am trying now to use Gensim to load GoogleNews pre-trained model to get phrases similarity. I've been told that The GoogleNews model includes vectors of phrases and words. But I find that I could only get word-similarity while phrase-similarity fails with an error message that the phrase is not in the vocabulary. Please advise me. Thank you.

import gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors

GOOGLE_MODEL = '../GoogleNews-vectors-negative300.bin'

model = gensim.models.KeyedVectors.load_word2vec_format(GOOGLE_MODEL, binary=True) 


# done well
model.most_similar("computer", topn=3) 

# done with error message "computer_software" is not in the vocabulory.
model.most_similar("computer_software", topn=3) 

推荐答案

GoogleNews集合确实包含许多通过统计分析创建的多词短语,但可能不包含您希望它做的特定事情,像'computer_software'.

The GoogleNews set does include many multi-word phrases, as created via some statistical analysis, but might not include something specific you're hoping it does, like 'computer_software'.

另一方面,我看到一个在线单词列表,提示GoogleNews词汇中的词如'composite_fillings' ,因此这可能对您有用:

On the other hand, I see an online word-list suggesting that a phrase like 'composite_fillings' is in the GoogleNews vocabulary, so this will likely work for you:

model.most_similar("composite_fillings", topn=3) 

有了该向量集,您将被限制在他们选择建模为短语的对象上.如果您需要其他词组具有类似强度的向量,则可能需要在语料库上训练自己的模型,该语料库将对您重要的词组组合为单个标记. (如果您只需要比不做的好,将构成单词的单词向量平均起来将为您提供一些有用的东西……但这是一个非常粗暴的替代品,可以根据其实际对bigram/multigram进行建模独特的上下文.)

With that vector-set, you're limited to what they chose to model as phrases. If you need similarly-strong vectors for other phrases, you'd likely need to train your own model, on a corpus where the phrases important to you have been combined into single tokens. (If you just need something-better-than-nothing, averaging together the constituent words' word-vectors would give you something to work with... but that's a pretty-crude stand-in for truly modeling the bigram/multigram against its unique contexts.)

这篇关于如何在Gensim上使用预训练的模型对单词和短语进行聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆