比较名称之间的相似性 [英] Compare similarity between names

查看:81
本文介绍了比较名称之间的相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须根据名称对某些数据进行交叉验证.

I have to make a cross-validation for some data based on names.

我面临的问题是,根据来源,名称会有细微的变化,例如:

The problem I'm facing is that depending on the source, names have slight variations, for example:

L & L AIR CONDITIONING   vs L & L AIR CONDITIONING Service

BEST ROOFING vs ROOFING INC

我有数千条记录,因此手动执行将非常耗时,我想尽可能地使流程自动化.

I have several thousands of records so do it manually will be very time demanding, I want to automate the process as much as possible.

由于还有其他单词,不足以小写名称.

Since there are additional words it wouldn't be enough to lowercase the names.

哪种算法可以很好地解决这个问题?

Which are good algorithms to handle this?

也许要计算相关性,使"INC"或"Service"之类的词权重较低

Maybe to calculate the correlation giving low weight to words like 'INC' or 'Service'

我尝试了difflib库

I tried the difflib library

difflib.SequenceMatcher(None,name_1.lower(),name_2.lower()).ratio()

我得到了不错的结果.

推荐答案

我将使用余弦相似度来实现相同目的.它将为您提供与弦的接近程度相匹配的分数.

I would use cosine similarity to achieve the same. It will give you a matching score of how close the strings are.

以下是可以帮助您实现此目的的代码(几个月前,我记得从Stackoverflow本身获取了此代码-现在找不到链接)

Here is the code to help you with the same (I remember getting this code from Stackoverflow itself, some months ago - couldn't find the link now)

import re, math
from collections import Counter

WORD = re.compile(r'\w+')

def get_cosine(vec1, vec2):
    # print vec1, vec2
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def text_to_vector(text):
    return Counter(WORD.findall(text))

def get_similarity(a, b):
    a = text_to_vector(a.strip().lower())
    b = text_to_vector(b.strip().lower())

    return get_cosine(a, b)

get_similarity('L & L AIR CONDITIONING', 'L & L AIR CONDITIONING Service') # returns 0.9258200997725514

我发现另一个有用的版本是基于NLP的,我编写了它.

Another version that I found useful was slightly NLP based and I authored it.

import re, math
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import *
from nltk.corpus import wordnet as wn

stop = stopwords.words('english')

WORD = re.compile(r'\w+')
stemmer = PorterStemmer()

def get_cosine(vec1, vec2):
    # print vec1, vec2
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def text_to_vector(text):
    words = WORD.findall(text)
    a = []
    for i in words:
        for ss in wn.synsets(i):
            a.extend(ss.lemma_names())
    for i in words:
        if i not in a:
            a.append(i)
    a = set(a)
    w = [stemmer.stem(i) for i in a if i not in stop]
    return Counter(w)

def get_similarity(a, b):
    a = text_to_vector(a.strip().lower())
    b = text_to_vector(b.strip().lower())

    return get_cosine(a, b)

def get_char_wise_similarity(a, b):
    a = text_to_vector(a.strip().lower())
    b = text_to_vector(b.strip().lower())
    s = []

    for i in a:
        for j in b:
            s.append(get_similarity(str(i), str(j)))
    try:
        return sum(s)/float(len(s))
    except: # len(s) == 0
        return 0

get_similarity('I am a good boy', 'I am a very disciplined guy')
# Returns 0.5491201525567068

您可以同时调用get_similarityget_char_wise_similarity来查看哪种方法更适合您的用例.我同时使用了两种方法-正常相似性可以清除非常接近的相似性,然后使用字符明智的相似性来清除足够接近的相似性.然后其余的必须手动处理.

You can call both get_similarity or get_char_wise_similarity to see what works for your use case better. I used both - normal similarity to weed out really close ones, and then character wise similarity to weed out close enough ones. And then the remaining ones had to be dealt with manually.

这篇关于比较名称之间的相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆