两个句子之间的空间性,奇怪相似性 [英] Spacy, Strange similarity between two sentences

查看:76
本文介绍了两个句子之间的空间性,奇怪相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经下载了en_core_web_lg模型并试图找到两个句子之间的相似性:

I have downloaded en_core_web_lg model and trying to find similarity between two sentences:

nlp = spacy.load('en_core_web_lg')

search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))

哪个返回值很奇怪:

0.9066019751888448

这两个句子的相似度不应 90%.

These two sentences should not be 90% similar they have very different meanings.

为什么会这样?为了使相似度结果更合理,是否需要添加一些额外的词汇?

Why this is happening? Do I need to add some kind of additional vocabulary in order to make similarity result more reasonable?

推荐答案

Spacy文档向量相似性说明了其基本概念:
每个单词都有一个向量表示法,可以通过上下文嵌入( Word2Vec )学习,并在语料库上对其进行训练,如文档中所述.

The Spacy documentation for vector similarity explains the basic idea of it:
Each word has a vector representation, learned by contextual embeddings (Word2Vec), which are trained on the corpora, as explained in the documentation.

现在,完整句子的单词嵌入只是所有不同单词的平均值.如果您现在有很多单词在语义上位于同一区域(例如,诸如"he","was","this",...等填充词)和附加词汇"cancel out",那么您最终可能会与您的情况相似.

Now, the word embedding of a full sentence is simply the average over all different words. If you now have a lot of words that semantically lie in the same region (as for example filler words like "he", "was", "this", ...), and the additional vocabulary "cancels out", then you might end up with a similarity as seen in your case.

正确的问题是您可以做些什么:从我的角度来看,您可以想出一个更复杂的相似性度量.由于search_docmain_doc具有其他信息,例如原始句子,因此您可以通过长度差罚分来修改向量,或者尝试比较句子的较短部分,并计算成对相似度(然后再次提出问题将是要比较的部分.

The question is rightfully what you can do about it: From my perspective, you could come up with a more complex similarity measure. As the search_doc and main_doc have additional information, like the original sentence, you could modify the vectors by a length difference penalty, or alternatively try to compare shorter pieces of the sentence, and compute pairwise similarities (then again, the question would be which parts to compare).

遗憾的是,目前还没有一种简单的方法可以简单地解决此问题.

For now, there is no clean way to simply resolve this issue, sadly.

这篇关于两个句子之间的空间性,奇怪相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆