Excel数据的模糊逻辑-Pandas [英] Fuzzy logic for excel data -Pandas

查看:137
本文介绍了Excel数据的模糊逻辑-Pandas的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧DF(〜100k行),这是一个原始数据文件,还有DF1(15k行),映射文件.我正在尝试将DF.address和DF.Name列匹配到DF1.Address和DF1.Name.找到匹配项后,应在DF.ID中填充DF1.ID(如果DF1.ID不为None),否则应在DF.ID中填充DF1.top_ID.

I have two dataframes DF(~100k rows)which is a raw data file and DF1(15k rows), mapping file. I'm trying to match the DF.address and DF.Name columns to DF1.Address and DF1.Name. Once the match is found DF1.ID should be populated in DF.ID(if DF1.ID is not None) else DF1.top_ID should be populated in DF.ID.

我能够借助模糊逻辑来匹配地址和名称,但是我仍然坚持如何连接获得的结果以填充ID.

I'm able to match the address and name with the help of fuzzy logic but i'm stuck how to connect the result obtained to populate the ID.

DF1-映射文件

DF1-Mapping file

DF原始数据文件

DF Raw Data file

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from operator import itemgetter




df=pd.read_excel("Test1", index=False)
df1=pd.read_excel("Test2", index=False)


df=df[df['ID'].isnull()]
zip_code=df['Zip'].tolist()
Facility_city=df['City'].tolist()
Address=df['Address'].tolist()
Name_list=df['Name'].tolist()


def fuzzy_match(x, choice, scorer, cutoff):
    return (process.extractOne(x, 
                               choices=choice, 
                               scorer=scorer, 
                               score_cutoff=cutoff))

for pin,city,Add,Name in zip(zip_code,Facility_city,Address,Name_list):
        #====Address Matching=====#
        choice=df1.loc[(df1['Zip']==pin) &(df1['City']==city),'Address1']
        result=fuzzy_match(Add,choice,fuzz.ratio,70)
        #====Name Matching========#
        if (result is not None):
            if (result[3]>70):
                choice_1=(df1.loc[(df1['Zip']==pin) &(df1['City']==city),'Name'])
                result_1=(fuzzy_match(Name,choice_1,fuzz.ratio,95))
                print(ID)
                if (result_1 is not None):
                    if(result_1[3]>95):
                       #Here populating the matching ID
                        print("ok")       


                    else:
                        continue
                else:
                    continue
            else:
                continue
        else:

推荐答案

IIUC:这是一个解决方案:

IIUC: Here is a solution:

from fuzzywuzzy import fuzz
import pandas as pd

#Read raw data from clipboard
raw = pd.read_clipboard()

#Read map data from clipboard
mp = pd.read_clipboard()

#Merge raw data and mp data as following 
dfr = mp.merge(raw, on=['Hospital Name', 'City', 'Pincode'], how='outer')

#dfr will have many duplicate rows - eliminate duplicate
#To eliminate duplicate using toke_sort_ratio, compare address x and y
dfr['SCORE'] = dfr.apply(lambda x: fuzz.token_sort_ratio(x['Address_x'], x['Address_y']), axis=1)

#Filter only max ratio rows grouped by Address_x
dfr1 = dfr.iloc[dfr.groupby('Address_x').apply(lambda x: x['SCORE'].idxmax())]
#dfr1 shall have the desired result

链接 有样本数据可以测试提供的解决方案.

This link has sample data to test the solution provided.

这篇关于Excel数据的模糊逻辑-Pandas的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆