有没有一种简单的方法告诉SpaCy在使用.similarity方法时忽略停用词? [英] Is there a simple way to tell SpaCy to ignore stop words when using .similarity method?

查看:270
本文介绍了有没有一种简单的方法告诉SpaCy在使用.similarity方法时忽略停用词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在,我有一个非常简单的程序,该程序将提取一个句子,并在给定的书中找到语义上最相似的句子,并打印出该句子以及接下来的几个句子.

import spacy
nlp = spacy.load('en_core_web_lg')

#load alice in wonderland
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
text = strip_headers(load_etext(11)).strip()

alice = nlp(text)

sentences = list(alice.sents)

mysent = nlp(unicode("example sentence, could be whatever"))

best_match = None
best_similarity_value = 0
for sent in sentences:
    similarity = sent.similarity(mysent)
    if similarity > best_similarity_value:
        best_similarity_value = similarity
        best_match = sent

print sentences[sentences.index(best_match):sentences.index(best_match)+10]

我想通过告诉SpaCy在执行此过程时忽略停用词来获得更好的结果,但是我不知道解决此问题的最佳方法.就像我可以创建一个新的空白列表并将不是停用词的每个单词附加到列表中一样

for sentence in sentences:
    for word in sentence:
        if word.is_stop == 'False':
            newlist.append(word)

但是我必须使其比上面的代码更复杂,因为我必须保持原始句子列表的完整性(因为如果以后要打印完整的句子,索引必须相同).另外,如果我采用这种方式,则必须使用SpaCy重新运行此新列表列表,才能使用.similarity方法.

我觉得必须有一个更好的方法来解决这个问题,我非常感谢任何指导.即使没有比将每个不停词都追加到新列表更好的方法了,我也很感谢在创建列表列表方面的任何帮助,以使索引与原始句子"变量相同.

非常感谢!

解决方案

您需要做的是覆盖spaCy计算相似度的方式.

对于相似度计算,spaCy首先通过平均每个令牌的矢量(token.vector属性)为每个文档计算一个矢量,然后通过执行以下操作来执行余弦相似度:

return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

您必须对此稍作调整,而不要考虑停用词的向量.

以下代码应为您工作:

import spacy
from spacy.lang.en import STOP_WORDS
import numpy as np
nlp = spacy.load('en_core_web_lg')
doc1 = nlp("This is a sentence")
doc2 = nlp("This is a baby")

def compute_similarity(doc1, doc2):
    vector1 = np.zeros(300)
    vector2 = np.zeros(300)
    for token in doc1:
        if (token.text not in STOP_WORDS):
            vector1 = vector1 + token.vector
    vector1 = np.divide(vector1, len(doc1))
    for token in doc2:
        if (token.text not in STOP_WORDS):
            vector2 = vector2 + token.vector
    vector2 = np.divide(vector2, len(doc2))
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

print(compute_similarity(doc1, doc2)))

希望有帮助!

So right now I have a really simple program that will take a sentence and find the sentence in a given book that is most semantically similar and prints out that sentence along with the next few sentences.

import spacy
nlp = spacy.load('en_core_web_lg')

#load alice in wonderland
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
text = strip_headers(load_etext(11)).strip()

alice = nlp(text)

sentences = list(alice.sents)

mysent = nlp(unicode("example sentence, could be whatever"))

best_match = None
best_similarity_value = 0
for sent in sentences:
    similarity = sent.similarity(mysent)
    if similarity > best_similarity_value:
        best_similarity_value = similarity
        best_match = sent

print sentences[sentences.index(best_match):sentences.index(best_match)+10]

I want to get better results by telling SpaCy to ignore the stop words when doing this process, but I don't know the best way to go about this. Like I could create a new blank list and append each word that isn't a stop word to the list

for sentence in sentences:
    for word in sentence:
        if word.is_stop == 'False':
            newlist.append(word)

but I would have to make it more complicated than the code above because I would have to keep the integrity of the original list of sentences (because the indexes would have to be the same if I wanted to print out the full sentences later). Plus if I did it this way, I would have to run this new list of lists back through SpaCy in order to use the .similarity method.

I feel like there must be a better way of going about this, and I'd really appreciate any guidance. Even if there isn't a better way than appending each non-stop word to a new list, I'd appreciate any help in creating a list of lists so that the indexes will be identical to the original "sentences" variable.

Thanks so much!

解决方案

What you need to do is to overwrite the way spaCy computes similarity.

For similarity computation, spaCy firsts computes a vector for each doc by averaging the vectors of each token (token.vector attribute) and then performs cosine similarity by doing:

return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

You have to tweak this a bit and not take into account vectors of stop words.

The following code should work for you:

import spacy
from spacy.lang.en import STOP_WORDS
import numpy as np
nlp = spacy.load('en_core_web_lg')
doc1 = nlp("This is a sentence")
doc2 = nlp("This is a baby")

def compute_similarity(doc1, doc2):
    vector1 = np.zeros(300)
    vector2 = np.zeros(300)
    for token in doc1:
        if (token.text not in STOP_WORDS):
            vector1 = vector1 + token.vector
    vector1 = np.divide(vector1, len(doc1))
    for token in doc2:
        if (token.text not in STOP_WORDS):
            vector2 = vector2 + token.vector
    vector2 = np.divide(vector2, len(doc2))
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

print(compute_similarity(doc1, doc2)))

Hope it helps!

这篇关于有没有一种简单的方法告诉SpaCy在使用.similarity方法时忽略停用词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆