Spacy,两个句子之间的奇怪相似度 [英] Spacy, Strange similarity between two sentences

查看:36
本文介绍了Spacy,两个句子之间的奇怪相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经下载了 en_core_web_lg 模型并试图找到两个句子之间的相似性:

I have downloaded en_core_web_lg model and trying to find similarity between two sentences:

nlp = spacy.load('en_core_web_lg')

search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))

返回非常奇怪的值:

0.9066019751888448

这两个句子不应90%相似,它们的含义截然不同.

These two sentences should not be 90% similar they have very different meanings.

为什么会这样?我是否需要添加某种额外的词汇以使相似度结果更合理?

Why this is happening? Do I need to add some kind of additional vocabulary in order to make similarity result more reasonable?

推荐答案

Spacy 文档向量相似度解释了它的基本思想:
每个词都有一个向量表示,通过上下文嵌入(Word2Vec)学习,在语料库上训练,如文档中所述.

The Spacy documentation for vector similarity explains the basic idea of it:
Each word has a vector representation, learned by contextual embeddings (Word2Vec), which are trained on the corpora, as explained in the documentation.

现在,完整句子的词嵌入只是所有不同词的平均值.如果您现在有很多语义位于同一区域的词(例如填充词,如he"、was"、this"、...),并且附加词汇表cancels out",那么您最终可能会与您的案例相似.

Now, the word embedding of a full sentence is simply the average over all different words. If you now have a lot of words that semantically lie in the same region (as for example filler words like "he", "was", "this", ...), and the additional vocabulary "cancels out", then you might end up with a similarity as seen in your case.

问题是你能做些什么:从我的角度来看,你可以想出一个更复杂的相似性度量.由于 search_docmain_doc 有额外的信息,比如原始句子,你可以通过长度差异惩罚来修改向量,或者尝试比较句子的较短部分,并计算成对的相似性(问题是要比较哪些部分).

The question is rightfully what you can do about it: From my perspective, you could come up with a more complex similarity measure. As the search_doc and main_doc have additional information, like the original sentence, you could modify the vectors by a length difference penalty, or alternatively try to compare shorter pieces of the sentence, and compute pairwise similarities (then again, the question would be which parts to compare).

遗憾的是,目前还没有干净的方法可以简单地解决这个问题.

For now, there is no clean way to simply resolve this issue, sadly.

这篇关于Spacy,两个句子之间的奇怪相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆