Python模糊匹配(FuzzyWuzzy)-仅保留最佳匹配 [英] Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

查看：473 发布时间：2020/6/15 19:29:39 python string-matching fuzzy-search fuzzywuzzy

本文介绍了Python模糊匹配(FuzzyWuzzy)-仅保留最佳匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试模糊匹配两个csv文件，每个文件包含一列相似但不相同的名称.

I'm trying to fuzzy match two csv files, each containing one column of names, that are similar but not the same.

到目前为止，我的代码如下:

My code so far is as follows:

import pandas as pd
from pandas import DataFrame
from fuzzywuzzy import process
import csv

save_file = open('fuzzy_match_results.csv', 'w')
writer = csv.writer(save_file, lineterminator = '\n')

def parse_csv(path):

with open(path,'r') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        yield row


if __name__ == "__main__":
## Create lookup dictionary by parsing the products csv
data = {}
for row in parse_csv('names_1.csv'):
    data[row[0]] = row[0]

## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):
    #print(process.extract(row,data, limit = 100))
    for found, score, matchrow in process.extract(row, data, limit=100):
        if score >= 60:
            print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
            Digi_Results = [row, score, found]
            writer.writerow(Digi_Results)


save_file.close()

输出如下:

Name11 , 90 , Name25 
Name11 , 85 , Name24 
Name11 , 65 , Name29

脚本运行正常.输出是预期的. 但是我要找的只是最合适的.

The script works fine. The output is as expected. But what I am looking for is only the best match.

Name11 , 90 , Name25
Name12 , 95 , Name21
Name13 , 98 , Name22

因此，我需要基于第2列中的最大值，以某种方式删除第1列中的重复名称. 它应该很简单，但是我似乎无法弄清楚. 任何帮助将不胜感激.

So I need to somehow drop the duplicated names in column 1, based on the highest value in column 2. It should be fairly straightforward, but I can't seem to figure it out. Any help would be appreciated.

推荐答案

fuzzywuzzy的process.extract()以反向排序的顺序返回列表，最佳匹配排在最前面.

fuzzywuzzy's process.extract() returns the list in reverse sorted order , with the best match coming first.

因此，要查找最佳匹配，可以将limit参数设置为1，以便它仅返回最佳匹配，如果大于60，则可以将其写入到csv中，就像您是现在就做.

so to find just the best match, you can set the limit argument as 1 , so that it only returns the best match, and if that is greater than 60 , you can write it to the csv, like you are doing now.

示例-

from fuzzywuzzy import process
## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):

    for found, score, matchrow in process.extract(row, data, limit=1):
        if score >= 60:
            print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
            Digi_Results = [row, score, found]
            writer.writerow(Digi_Results)

这篇关于Python模糊匹配(FuzzyWuzzy)-仅保留最佳匹配的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python模糊匹配(FuzzyWuzzy)-仅保留最佳匹配 [英] Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python模糊匹配(FuzzyWuzzy)-仅保留最佳匹配 [英] Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭