计算单词列表之间的相似度 [英] Calculate similarity between list of words

查看:575
本文介绍了计算单词列表之间的相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算两个单词列表之间的相似度,例如:

I want to calculate the similarity between two list of words, for example :

['email','user','this ','email','address','customer']

类似于以下列表:

['email','mail','address','netmail']

我希望比其他列表具有更高的相似性百分比,例如:
['address','ip','network'] 列表中存在 地址

I want to have a higher percentage of similarity than another list, for example: ['address','ip','network'] even if address exists in the list.

推荐答案

由于您实际上还无法演示晶体输出,因此这是我的最佳镜头:

Since you haven't really been able to demonstrate a crystal output, here is my best shot:

list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']

在上面的两个列表中,我们将发现列表中每个元素与其余元素之间的余弦相似度。例如,来自 list_B 电子邮件以及 list_A 中的每个元素:

In the above two list, we will find the cosine similarity between each element of the list with the rest. i.e. email from list_B with every element in list_A:

def word2vec(word):
    from collections import Counter
    from math import sqrt

    # count the characters in word
    cw = Counter(word)
    # precomputes a set of the different characters
    sw = set(cw)
    # precomputes the "length" of the word vector
    lw = sqrt(sum(c*c for c in cw.values()))

    # return a tuple
    return cw, sw, lw

def cosdis(v1, v2):
    # which characters are common to the two words?
    common = v1[1].intersection(v2[1])
    # by definition of cosine distance we have
    return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]


list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']

threshold = 0.80     # if needed
for key in list_A:
    for word in list_B:
        try:
            # print(key)
            # print(word)
            res = cosdis(word2vec(word), word2vec(key))
            # print(res)
            print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100))
            # if res > threshold:
            #     print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
        except IndexError:
            pass

输出

The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : user is: 22.360679774997898
The cosine similarity between : mail and : user is: 0.0
The cosine similarity between : address and : user is: 60.30226891555272
The cosine similarity between : netmail and : user is: 18.89822365046136
The cosine similarity between : email and : this is: 22.360679774997898
The cosine similarity between : mail and : this is: 25.0
The cosine similarity between : address and : this is: 30.15113445777636
The cosine similarity between : netmail and : this is: 37.79644730092272
The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : address is: 26.967994498529684
The cosine similarity between : mail and : address is: 15.07556722888818
The cosine similarity between : address and : address is: 100.0
The cosine similarity between : netmail and : address is: 22.79211529192759
The cosine similarity between : email and : customer is: 31.62277660168379
The cosine similarity between : mail and : customer is: 17.677669529663685
The cosine similarity between : address and : customer is: 42.640143271122085
The cosine similarity between : netmail and : customer is: 40.08918628686365




注意:我还注释了代码中的阈值部分,以防万一
仅在相似度超过某个
时才需要这些单词阈值,即80%

Note: I have also commented the threshold part in the code, in case you only want the words if their similarity exceeds a certain threshold i.e. 80%

编辑

OP 但是我要确切执行的操作不是逐字比较,而是按列表列出

使用计数器数学

from collections import Counter
import math

counterA = Counter(list_A)
counterB = Counter(list_B)


def counter_cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

print(counter_cosine_similarity(counterA, counterB) * 100)

输出

53.03300858899106

这篇关于计算单词列表之间的相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆