是否有解决方案来获得单词列表之间的相似性评分? [英] Is there any solution to get score of similarity between lists of words?

查看：42 发布时间：2021/5/31 20:47:33 python numpy math similarity cosine-similarity

本文介绍了是否有解决方案来获得单词列表之间的相似性评分?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想计算单词列表之间的相似度，例如:

I want to calculate the similarity between lists of words, for example :

import math,re
from collections import Counter

test = ['address','ip']
list_a = ['identifiant', 'ip', 'address', 'fixe', 'horadatee', 'cookie', 'mac', 'machine', 'network', 'cable']
list_b = ['address','city']

def counter_cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    print(c2.get('ip',0)**2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

counter1 = Counter(test)
counter2 = Counter(list_a)
counter3 = Counter(list_b)
score = counter_cosine_similarity(counter1,counter2)
print(score) # output : 0.4472135954999579
score = counter_cosine_similarity(counter1,counter3)
print(score) # output : 0.4999999999999999

对我来说，这不是我想要获得的分数，分数必须相反，因为list_a包含地址和ip，所以它是100％测试匹配项，我知道在这种情况下余弦相似度与test和list_a进行了比较，因此由于list_a上有一些不在测试中的元素是因为分数很低，因此我将准确地将测试与list_a进行比较，而不是以两种方式进行比较.

For me it's not exactly the score I want to get, the score must be the opposite because list_a contains address and ip so it's a 100% test match I know that cosine similarity does the comparison in this case with test and list_a so since there is some element on the list_a which is not in test it is for that the score is low, so that I will do exactly it is compared that test compared to list_a in one way not in the two way.

所需的输出

score = counter_cosine_similarity(counter1,counter2)
print(score) # output : score higher than list_b = 1.0 may be
score = counter_cosine_similarity(counter1,counter3)
print(score) # output : score less the list_a = 0.5 may be

推荐答案

如果您想要更高的值，则更多的术语都相同，请使用以下代码:

If you want a higher value the more terms are the same, use this code:

 score = len(set(test).intersection(set(list_x)))

这将告诉您两个列表有多少个常用术语.如果您想获得更高的重复分数，请尝试

That will tell you how many common terms the two lists have. If you want to score repetitions higher, then try

 commonTerms = set(test).intersection(set(list_x))
 counter = Counter(list_x)
 score = sum((counter.get(term) for term in commonTerms)) #edited

如果您需要将分数缩放到[0..1]，我需要更多地了解您的数据集.

If you need scaling the score to [0..1], I need to know more about your data sets.

这篇关于是否有解决方案来获得单词列表之间的相似性评分?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

是否有解决方案来获得单词列表之间的相似性评分? [英] Is there any solution to get score of similarity between lists of words?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

是否有解决方案来获得单词列表之间的相似性评分? [英] Is there any solution to get score of similarity between lists of words?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭