如何使用Gensim的word2vec模型和python计算句子相似度 [英] How to calculate the sentence similarity using word2vec model of gensim with python

查看:659
本文介绍了如何使用Gensim的word2vec模型和python计算句子相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Gensim Word2Vec ,我可以使用gensim包中的word2vec模型来计算两个词之间的相似度.

According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words.

例如

trained_model.similarity('woman', 'man') 
0.73723527

但是,word2vec模型无法预测句子的相似性.我在gensim中发现了具有句子相似性的LSI模型,但是,似乎无法将它与word2vec模型结合使用.我拥有的每个句子的语料库长度不是很长(少于10个字).那么,有没有简单的方法可以达到目标呢?

However, the word2vec model fails to predict the sentence similarity. I find out the LSI model with sentence similarity in gensim, but, which doesn't seem that can be combined with word2vec model. The length of corpus of each sentence I have is not very long (shorter than 10 words). So, are there any simple ways to achieve the goal?

推荐答案

这实际上是您要问的一个非常具有挑战性的问题.计算句子相似度需要建立句子的语法模型,理解等效结构(例如昨天他去商店"和昨天他去商店"),不仅要在代词和动词中找到相似性,还要在句子中找到相似性.专有名词,在大量真实的文本示例中找到统计共现/关系等.

This is actually a pretty challenging problem that you are asking. Computing sentence similarity requires building a grammatical model of the sentence, understanding equivalent structures (e.g. "he walked to the store yesterday" and "yesterday, he walked to the store"), finding similarity not just in the pronouns and verbs but also in the proper nouns, finding statistical co-occurences / relationships in lots of real textual examples, etc.

您可以尝试的最简单的方法-尽管我不知道这样做的效果如何,并且肯定不会给您带来最佳效果-首先删除所有停止"字词(例如"the"之类的字词) ,"an"等对句子没有多大意义的词),然后对两个句子中的单词运行word2vec,将一个句子中的向量求和,将另一个句子中的向量求和,然后找到总和之间的差.通过对它们进行汇总,而不是逐字逐句地进行区别,您至少不会受到词序的限制.话虽如此,这将以多种方式失败,而且无论如何都不是一个好的解决方案(尽管对这个问题的好的解决方案几乎总是涉及到一定数量的NLP,机器学习和其他聪明才智).

The simplest thing you could try -- though I don't know how well this would perform and it would certainly not give you the optimal results -- would be to first remove all "stop" words (words like "the", "an", etc. that don't add much meaning to the sentence) and then run word2vec on the words in both sentences, sum up the vectors in the one sentence, sum up the vectors in the other sentence, and then find the difference between the sums. By summing them up instead of doing a word-wise difference, you'll at least not be subject to word order. That being said, this will fail in lots of ways and isn't a good solution by any means (though good solutions to this problem almost always involve some amount of NLP, machine learning, and other cleverness).

所以,简短的回答是,不,没有简单的方法可以做到这一点(至少不能很好地做到这一点).

So, short answer is, no, there's no easy way to do this (at least not to do it well).

这篇关于如何使用Gensim的word2vec模型和python计算句子相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆