在新列中获取具有类似地址的ID [英] Get ids with similar address in a new column
问题描述
我有一个数据框,通过该数据框我处理一些列,以获取每个客户ID的地址与每个其他客户ID的地址的匹配百分比.如果某些地址与其他地址的百分比大于80的其他地址匹配,那么我想在新列中收集相应的客户ID
I have a dataframe from which i process some columns to get the matching percentage of address of each customer id with address of every other customer id. If some addresses match the other addresses with percentage above 80 then I want to gather their corresponding customer ids in a new column
我编写了代码,其中获得了具有地址的元组列表及其在每个元组中的对应分数.
I have made the code in which i get the list of tuples having the address and their corresponding scores in each tuple.
import pandas as pd
from fuzzywuzzy import process
def pat_match(id,address):
length01=len(id) # normal integer sequence 1 to 10
for y in range(0,length01):
score=process.extractBests(address[y],address,score_cutoff=80)
print(score) # actual results(list of tuples)
d2=[sc[1] for sc in score]
#print(d2) # variable having list of scores per address tuple
if __name__ == '__main__':
data = pd.read_csv(r"address_details.csv", skiprows=0)
id = data['COD_CUST_ID'].values.tolist()
address = data['ADDRESS'].values.tolist()
pat_match(id,address)
假设我输入的数据为
Customer_ID Address
21213944 VPO. SAHWA CHURU RAJASTHAN 331302
21991538 WARD NO.-3 NATT ROAD TALWANDI SABO BATHINDA BATHINDA PUNJAB 151302
21991539 H.NO.-137 RAMA ROAD TALWANDI SABO BATHINDA BATHINDA PUNJAB 151302
21603327 VAGPUR KARCHCHA KALAN UDAIPUR RAJASTHAN 313803
21215934 VILLAGE GORIYAN TEHSIL UDAIPURWATI DIST JHUNJHUNU JHUJHUNU RAJASTHAN 333307
变量SCORE的中间输出是
And the intermediate output of the variable SCORE is
[('WARD NO 25 GHADSISAR ROAD BASANT KUNJ KE SAMNE HANUMAN MANDIR KE PASS CHOUDHARY COLONY GANGASHAR BIKANER RAJASTHAN 334001', 100), ('VPO. SAHWA CHURU RAJASTHAN 331302', 86), ('WARD NO.-3 NATT ROAD TALWANDI SABO BATHINDA BATHINDA PUNJAB 151302', 86), ('H.NO.-137 RAMA ROAD TALWANDI SABO BATHINDA BATHINDA PUNJAB 151302', 86), ('Karchha Kalan UDAIPUR RAJASTHAN 313803', 86)]
[('Karchha Kalan UDAIPUR RAJASTHAN 313803', 100), ('VAGPUR KARCHCHA KALAN UDAIPUR RAJASTHAN 313803', 91), ('WARD NO 25 GHADSISAR ROAD BASANT KUNJ KE SAMNE HANUMAN MANDIR KE PASS CHOUDHARY COLONY GANGASHAR BIKANER RAJASTHAN 334001', 86), ('VILLAGE GORIYAN TEHSIL UDAIPURWATI DIST JHUNJHUNU JHUJHUNU RAJASTHAN 333307', 86)]
[('VAGPUR KARCHCHA KALAN UDAIPUR RAJASTHAN 313803', 100), ('Karchha Kalan UDAIPUR RAJASTHAN 313803', 91), ('WARD NO 25 GHADSISAR ROAD BASANT KUNJ KE SAMNE HANUMAN MANDIR KE PASS CHOUDHARY COLONY GANGASHAR BIKANER RAJASTHAN 334001', 86), ('VILLAGE GORIYAN TEHSIL UDAIPURWATI DIST JHUNJHUNU JHUJHUNU RAJASTHAN 333307', 86)]
[('VILLAGE GORIYAN TEHSIL UDAIPURWATI DIST JHUNJHUNU JHUJHUNU RAJASTHAN 333307', 100), ('VPO. SAHWA CHURU RAJASTHAN 331302', 86), ('WARD NO 25 GHADSISAR ROAD BASANT KUNJ KE SAMNE HANUMAN MANDIR KE PASS CHOUDHARY COLONY GANGASHAR BIKANER RAJASTHAN 334001', 86), ('Karchha Kalan UDAIPUR RAJASTHAN 313803', 86), ('VAGPUR KARCHCHA KALAN UDAIPUR RAJASTHAN 313803', 86)]
我想成为的最终输出是
Search String Match Customer Ids
WARD NO.-3 NATT ROAD TALWANDI SABO BATHINDA BATHINDA PUNJAB 151302 21991538,21991539
VAGPUR KARCHCHA KALAN UDAIPUR RAJASTHAN 313803 21603327,21215934
推荐答案
根据您的问题,此解决方案将起作用,代码不言自明:)
As per your problem, this solution will work, Code is self-explanatory :)
# Getting the DataFrame as the parameter
def pat_match(df):
# Getting the column values of id and address in seprate list
id = df['COD_CUST_ID'].values.tolist()
address = df['ADDRESS'].values.tolist()
# Creating a new column with name 'Ids'
df['Ids'] = ""
length01=len(id)
for y in range(0,length01):
# The mathched address Id will will be appended in a list for every address
matched_ids = []
# Calculating list of address with match percentage more than 80%
score=process.extractBests(address[y],address,score_cutoff=80)
# Iterating over every address returned by score one by one
for matched_address in score:
# Getting Customer_ID of every Address
get = df['Customer_ID'][df['Address']==matched_address].tolist()[0]
# Appending the Id into a list
matched_ids.append(get)
# Finally Appending the list of matched ID to the column
df['Ids'][df['Customer_ID']==id[y]] = str(matched_ids)
主要功能:
if __name__ == '__main__':
data = pd.read_csv(r"address_details.csv", skiprows=0)
pat_match(data)
这篇关于在新列中获取具有类似地址的ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!