识别具有相似地址的ID [英] Identify IDs with similar address

查看:76
本文介绍了识别具有相似地址的ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个csv文件中有一个数据,该数据基本上具有一些ID,其对应的地址以及1个地址与其他地址的匹配相似率.我想确定地址相似的ID及其匹配百分比

I have a data in a csv file which basically has some IDs, their corresponding address and the matching similarity percentage of 1 address with other. I want to identify the IDs which have got similar address alongwith their match percentage

我已经完成了文本匹配,找到了将1个地址与其他每个地址进行比较的地址字符串之间的相似度百分比.

I have done the text matching and found the similarity percentage between the address strings comparing 1 address to every other address.

import pandas as pd
from fuzzywuzzy import process, fuzz

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 10)

data = pd.read_csv(r"address_details.csv", skiprows=0)
id = data['COD_CUST_ID'].values.tolist()
address = data['ADDRESS'].values.tolist()

dict_list=[]

for i in range(0,len(id)):
    for add in range(0,len(address)):
        score=process.extractBests(address[add], address, limit=len(address), score_cutoff=40)
        #print(type(score))

        for sc in score:
            #print(sc)
            for scr in sc:
                print(scr)

            dict_={}
            dict_.update({"Cust_Id": id[i]})
            dict_.update({"Match Ratio": sc})
            dict_.update({"Search String": address[add]})
        #dict_.update({"Address List": address})

            dict_list.append(dict_)

df=pd.DataFrame(dict_list)


matches = df['Match Ratio'].tolist()
matches = [x[0][0] for x in matches]

found  = []
for s in df['Search String']:
    data_list=[]

    if s in matches:
        index=[i for i, x in enumerate(matches) if x == s]
        Cust_Id = list([df['Cust_Id'][i]] for i in index)
        data_list.append(s)
        data_list.append(Cust_Id)
        found.append(data_list)
print(found)

sd=df.to_csv("match_score.csv",sep=',',index=None)

假设我将此数据帧作为代码输出

Suppose i have this dataframe as my code output

Cust_Id Match Ratio Search String
1   [('ABC', 100)]  ABC
2   [('DEF', 100)]  DEF
3   [('DEF', 100)]  XYZ
4   [('ABC', 100)]  PQR
5   [('PQR', 100)]  TUV
6   [('DEF', 100)]  LMN

我想在匹配比率"列下获取具有相似数据的IDS列表

I want to get a list of the IDS having similar data under Match Ratio column

推荐答案

我编写了一个代码,该代码给出了包含搜索字符串"及其对应的匹配"Cust_Id"的列表.

I have wrote a code which gives a list containg the "Search string" and it's corresponding matching 'Cust_Id'.

代码是

 import pandas as pd

def duplicates(lst, item):
   return [i for i, x in enumerate(lst) if x == item]

# Creating Data frame
data = {'Cust_Id' : ['1 ','2' , '3','4','5','6'],
        'Match Ratio'  : [[('ABC', 100)],[('DEF', 100)],[('DEF', 100)], [('ABC', 100)],[('PQR', 100)],[('DEF', 100)]],
        'Search' : ['ABC','DEF','XYZ','PQR','TUV','LMN']
        }
df = pd.DataFrame(data)

print(df)
# Creating a list of 1'st value of tuple Match Ratio
matches = df['Match Ratio'].tolist()
matches = [x[0][0] for x in matches]

found  = []
for s in df['Search']:
    data_list = []
    if s in matches:
        index = duplicates(matches,s)
        Cust_Id = list([df['Cust_Id'][i]] for i in index)
        data_list.append(s)
        data_list.append(Cust_Id)
        found.append(data_list)
print(found)

数据帧输出

  Cust_Id   Match Ratio Search
0      1   [(ABC, 100)]    ABC
1       2  [(DEF, 100)]    DEF
2       3  [(DEF, 100)]    XYZ
3       4  [(ABC, 100)]    PQR
4       5  [(PQR, 100)]    TUV
5       6  [(DEF, 100)]    LMN

发现列表输出

[['ABC', [['1 '], ['4']]], ['DEF', [['2'], ['3'], ['6']]], ['PQR', [['5']]]]

希望您能找到想要的东西:)

Hope you got what you were looking for :)

这篇关于识别具有相似地址的ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆