使用python将近似字符串匹配从excel复制到另一个excel文件 [英] Copy approximate string matching from excel to another excel file using python

查看:134
本文介绍了使用python将近似字符串匹配从excel复制到另一个excel文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想问一下如何将某些行从一个Excel文件复制到另一个Excel文件.通过使用python模糊匹配方法或 ANY 其他可行的方法,希望根据名称将整个行进行匹配并复制到新的excel文件中.

Hi I would like to ask on how to copy some of the row from one excel file to another excel file. By using python fuzzy matching method or ANY other feasible way, the entire row by according to the name is hope to be matched and copied into new excel file.

这是第一个excel文件的输入数据,共有13行6列,如下所示:

Here is the input data from first excel file, there is 13 rows and 6 columns in total as shown below:

-----------------------------------------------------|-----|-----|-----|-----|-----|
| name                                               | no1 | no2 | no3 | no4 | no5 |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Club___Short___Water           | abc | abc | abc | abc | abc |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Short___Water                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long___Land to Short___Water           | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___BB to Penang___AA                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___SD to Penang___SD                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA(N) to Back___Garden(N)     | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA___(N) to Back___Garden     | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Left___House___Hostel(w) to NothingNow___(w)  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Laksama to Kota_Dun                           | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

通过插入第一行,我想让python识别行的近似名称,然后复制整行并粘贴到新的excel文件中.通过比较单词的相似性而不是字母的相似性,例如有多少个单词相同,如果大于或等于某个数量(比如说50%),它将通过复制.

By ingoring the first row, I would like to let python to recognize the approximate similar name of row and copy entire row and paste into a new excel file. By comparing the similarity of the words instead of alphabet, like how many of words is the same, if more than or equal to a certain amount (let say 50%), it would pass to copy.

例如,通过比较第2行和第3行,from Club___Long to Club___Short___Waterfrom Club___Long to Short___Water非常相似,from Club___Long to Club___Short___Water有7个单词,而from Club___Long to Short___Water有6个单词.在from Club___Long to Club___Short___Water的7个单词中,有6个与from Club___Long to Short___Water相似的单词.因此,6 / 7 * 100% = 85.71%大于50%,python会将其视为匹配并复制.

For example, by comparing row 2 and row 3, from Club___Long to Club___Short___Water is quite similar to from Club___Long to Short___Water, from Club___Long to Club___Short___Water has 7 words while from Club___Long to Short___Water has 6 words. Out of 7 words of from Club___Long to Club___Short___Water, there is 6 words similar to from Club___Long to Short___Water. Therefore, 6 / 7 * 100% = 85.71% which is more than 50%, python would consider it as matched and copy it.

例如,第2行到第4行大致相同,因此python将对其进行匹配并识别出几乎相同的内容,然后仅将整个第2行复制到整个第4行到新的excel文件中,并将其命名为"new_file_1". xlsx'.所需的输出如下所示:

For instance, row 2 to row 4 is approximately the same, so python would match it and recognize it almost the same, and copy only entire row 2 to entire row 4 to new excel file, and name it as 'new_file_1.xlsx'. The desired output as shown below:

-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Club___Short___Water           | abc | abc | abc | abc | abc |
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long to Short___Water                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Club___Long___Land to Short___Water           | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

第5行和第6行相同,并将其命名为"new_file_2.xlsx",所需输出如下所示:

Same goes with the row 5 and row 6, and name it as 'new_file_2.xlsx', the desired output as shown below:

-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___BB to Penang___AA                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Kinabalu___SD to Penang___SD                  | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

第7行到第9行相同,并将其命名为"new_file_3.xlsx",所需输出如下所示:

Same goes with the row 7 until row 9, and name it as 'new_file_3.xlsx', the desired output as shown below:

-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Hill___Town to Unknown___Island___Ice         | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

第10行到第11行相同,并将其命名为"new_file_4.xlsx",所需输出如下所示:

Same goes with the row 10 until row 11, and name it as 'new_file_4.xlsx', the desired output as shown below:

-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA(N) to Back___Garden(N)     | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|
| from Front___House___AA___(N) to Back___Garden     | def | def | def | def | def |  
-----------------------------------------------------|-----|-----|-----|-----|-----|

关于第12行和第13行,它们都与其他行不同,因此不必复制,只需保留它即可.

Regarding row 12 and row 13, they both is dissimilar to other row, so do not have to copy, just leave it.

如果有人可以帮助我,我会非常感激,谢谢!

Appreciate a lot if anyone can help me out, thanks!

推荐答案

我创建了一个函数来替换重复项.它基于模糊逻辑.考虑到截断,我只是用所有其他名称的最高匹配项替换每个名称.然后,创建一个新列,在其中存储这些唯一的名称

I create a function to replace duplicates. it's base on fuzzy logic. I simply substitute each name with the highest matching of all the other names considering a cutoff. Then, I create a new column where I store these unique names

import difflib
import re

def similarity_replace(series):

    reverse_map = {}
    diz_map = {}
    for i,s in series.iteritems():

        clean_s = re.sub(r'(from)|(to)', '', s.lower())
        clean_s = re.sub(r'[^a-z]', '', clean_s)

        diz_map[s] = clean_s
        reverse_map[clean_s] = s

    best_match = {}
    uni = list(set(diz_map.values()))
    for w in uni:
        best_match[w] = sorted(difflib.get_close_matches(w, uni, n=3, cutoff=0.6))[0]

    return series.map(diz_map).map(best_match).map(reverse_map)

df = pd.DataFrame({'name':['from Club___Long to Club___Short___Water','from Club___Long to Short___Water',
                           'from Club___Long___Land to Short___Water','from Kinabalu___BB to Penang___AA',
                           'from Kinabalu___SD to Penang___SD','from Hill___Town to Unknown___Island___Ice',
                           'from Hill___Town to Unknown___Island___Ice','from Hill___Town to Unknown___Island___Ice',
                           'from Front___House___AA(N) to Back___Garden(N)','from Front___House___AA___(N) to Back___Garden',
                           'from Left___House___Hostel(w) to NothingNow___(w)','from Laksama to Kota_Dun'],
                  'no1':['adb','adb','adb','adb','adb','adb','adb','adb','adb','adb','adb','adb']})

df['group_name'] = similarity_replace(df.name)
df

我们可以使用此列对所有相似且相似的实例进行分组

we can use this column to group all the instances that are similar and to something

for i,group in df.groupby('group_name'):

    ### do something ###
    print(group[['name','no1']])

这篇关于使用python将近似字符串匹配从excel复制到另一个excel文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆