如何在Gensim上使用预训练的模型对单词和短语进行聚类 [英] How to Cluster words and phrases with pre-trained model on Gensim

查看：309 发布时间：2020/11/13 6:22:45 gensim word2vec

本文介绍了如何在Gensim上使用预训练的模型对单词和短语进行聚类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想要的是将单词和短语进行聚类，例如编织/编织机/织机编织/织机/彩虹织机/家庭装饰配件/织机/编织机/...而我只有一个词/短语，却没有语料库.我可以使用像GoogleNews/Wikipedia/...这样的预训练模型来实现它吗?

What I want exactly is to cluster words and phrases, e.g. knitting/knit loom/loom knitting/weaving loom/rainbow loom/home decoration accessories/loom knit/knitting loom/...And I don'd have corpus while I have only the words/phrases. Could I use a pre-trained model like the one from GoogleNews/Wikipedia/... to realise it?

我现在正在尝试使用Gensim加载GoogleNews预先训练的模型，以获取短语相似性.有人告诉我GoogleNews模型包括短语和单词的向量.但是我发现我只能得到词相似度，而短语相似度会失败，并显示一条错误消息，指出该短语不在词汇表中.请给我提意见.谢谢.

I am trying now to use Gensim to load GoogleNews pre-trained model to get phrases similarity. I've been told that The GoogleNews model includes vectors of phrases and words. But I find that I could only get word-similarity while phrase-similarity fails with an error message that the phrase is not in the vocabulary. Please advise me. Thank you.

import gensim
from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors

GOOGLE_MODEL = '../GoogleNews-vectors-negative300.bin'

model = gensim.models.KeyedVectors.load_word2vec_format(GOOGLE_MODEL, binary=True) 


# done well
model.most_similar("computer", topn=3) 

# done with error message "computer_software" is not in the vocabulory.
model.most_similar("computer_software", topn=3)

推荐答案

GoogleNews集合确实包含许多通过统计分析创建的多词短语，但可能不包含您希望它做的特定事情，像'computer_software'.

The GoogleNews set does include many multi-word phrases, as created via some statistical analysis, but might not include something specific you're hoping it does, like 'computer_software'.

另一方面，我看到一个在线单词列表，提示GoogleNews词汇中的词如'composite_fillings' ，因此这可能对您有用:

On the other hand, I see an online word-list suggesting that a phrase like 'composite_fillings' is in the GoogleNews vocabulary, so this will likely work for you:

model.most_similar("composite_fillings", topn=3)

有了该向量集，您将被限制在他们选择建模为短语的对象上.如果您需要其他词组具有类似强度的向量，则可能需要在语料库上训练自己的模型，该语料库将对您重要的词组组合为单个标记. (如果您只需要比不做的好，将构成单词的单词向量平均起来将为您提供一些有用的东西……但这是一个非常粗暴的替代品，可以根据其实际对bigram/multigram进行建模独特的上下文.)

With that vector-set, you're limited to what they chose to model as phrases. If you need similarly-strong vectors for other phrases, you'd likely need to train your own model, on a corpus where the phrases important to you have been combined into single tokens. (If you just need something-better-than-nothing, averaging together the constituent words' word-vectors would give you something to work with... but that's a pretty-crude stand-in for truly modeling the bigram/multigram against its unique contexts.)

这篇关于如何在Gensim上使用预训练的模型对单词和短语进行聚类的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Gensim上使用预训练的模型对单词和短语进行聚类 [英] How to Cluster words and phrases with pre-trained model on Gensim

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Gensim上使用预训练的模型对单词和短语进行聚类 [英] How to Cluster words and phrases with pre-trained model on Gensim

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭