给定2个句子字符串,计算余弦相似度 [英] Calculate cosine similarity given 2 sentence strings

查看:207
本文介绍了给定2个句子字符串,计算余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

> Python:tf-idf-cosine:查找文档相似度,可以使用tf-idf余弦来计算文档相似度.无需导入外部库,是否有任何方法可以计算两个字符串之间的余弦相似度?

From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine. Without importing external libraries, are that any ways to calculate cosine similarity between 2 strings?

s1 = "This is a foo bar sentence ."
s2 = "This sentence is similar to a foo bar sentence ."
s3 = "What is this string ? Totally not related to the other two lines ."

cosine_sim(s1, s2) # Should give high cosine similarity
cosine_sim(s1, s3) # Shouldn't give high cosine similarity value
cosine_sim(s2, s3) # Shouldn't give high cosine similarity value

推荐答案

一个简单的纯Python实现是:

A simple pure-Python implementation would be:

import math
import re
from collections import Counter

WORD = re.compile(r"\w+")


def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator


def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)


text1 = "This is a foo bar sentence ."
text2 = "This sentence is similar to a foo bar sentence ."

vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)

cosine = get_cosine(vector1, vector2)

print("Cosine:", cosine)

打印:

Cosine: 0.861640436855

此处中描述了此处使用的余弦公式.

The cosine formula used here is described here.

这不包括tf-idf对单词的加权,但是要使用tf-idf,您需要有一个相当大的语料库才能从中估计tfidf的权重.

This does not include weighting of the words by tf-idf, but in order to use tf-idf, you need to have a reasonably large corpus from which to estimate tfidf weights.

您还可以通过使用更复杂的方法从一段文本中提取单词,对其进行词干或词义化等方式来进一步开发它.

You can also develop it further, by using a more sophisticated way to extract words from a piece of text, stem or lemmatise it, etc.

这篇关于给定2个句子字符串,计算余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆