Sklearn 字符串的余弦相似度,Python [英] Sklearn cosine similarity for strings, Python

查看:69
本文介绍了Sklearn 字符串的余弦相似度,Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个算法来检查一个字符串与另一个字符串的相等程度.我正在使用 Sklearn 余弦相似度.

I am writing an algorithm that checks how much a string is equal to another string. I am using Sklearn cosine similarity.

我的代码是:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

example_1 = ("I am okey", "I am okeu")
example_2 = ("I am okey", "I am crazy")

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(example_1)
result_cos = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
print(result_cos[0][1])

为 example_1 运行此代码,打印 0.336096927276.为 example_2 运行它,它打印出相同的分数.两种情况的结果相同,因为只有一个不同的词.

Running this code for example_1, prints 0.336096927276. Running it for example_2, it prints the same score. The result is the same in both cases because there is only one different word.

我想要的是为 example_1 获得更高的分数,因为不同的单词okey vs okeu"只有一个不同的字母.相比之下,在 example_2 中有两个完全不同的词oky vs crazy".

What I want is to get a higher score for example_1 because the different words "okey vs okeu" have only one different letter. In contrast in example_2 there are two completely different words "okey vs crazy".

我的代码如何考虑在某些情况下不同的词并不完全不同?

How can my code take in consideration that in some cases the different words are not completely different?

推荐答案

对于短字符串,Levenshtein distance 可能会产生比基于单词的余弦相似度更好的结果.下面的算法改编自 维基教科书.由于这是一个距离度量,因此分数越小越好.

For short strings, Levenshtein distance will probably yield better results than cosine similarity based on words. The algorithm below is adapted from Wikibooks. Since this is a distance metric, smaller score is better.

def levenshtein(s1, s2):
    if len(s1) < len(s2):
        s1, s2 = s2, s1

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]/float(len(s1))

example_1 = ("I am okey", "I am okeu")
example_2 = ("I am okey", "I am crazy")

print(levenshtein(*example_1))
print(levenshtein(*example_2))                                   

这篇关于Sklearn 字符串的余弦相似度,Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆