基于条件的2个大型数据集的模糊模糊字符串匹配-Python [英] Fuzzy Wuzzy String Matching on 2 Large Data Sets Based on a Condition - python

查看:83
本文介绍了基于条件的2个大型数据集的模糊模糊字符串匹配-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个已读入 Pandas DataFrames 的大型数据集(分别为 ~ 20K 行和 ~40K 行).当我尝试在地址字段上使用pandas.merge完全合并这两个DF时,与行数相比,我得到的匹配数很少.所以我想我会尝试对字符串匹配进行模糊处理,以查看它是否可以改善输出匹配的数量.

I have 2 large data sets that I have read into Pandas DataFrames (~ 20K rows and ~40K rows respectively). When I try merging these two DFs outright using pandas.merge on the address field, I get a paltry number of match compared to the number of rows. So I thought I would try to fuzzy string match to see if it improves the number of output matches.

我通过尝试在DF1中创建新列(20K行)来解决这个问题,这是在DF1 [addressline]到DF2 [addressline]上应用Fuzzywuzzy extractone函数的结果.我很快意识到这将是永远的,因为它将进行近10亿次比较.

I approached this by trying to create a new column in DF1 (20K rows) that was the result of applying the fuzzywuzzy extractone function on DF1[addressline] to DF2[addressline]. I shortly realized that this would take forever since it will be doing close to 1 billion comparisons.

这两个数据集中都有县"字段,我的问题是:是否有办法基于县"字段相同,有条件地在两个DF中的地址线"字段上进行模糊字符串匹配?在研究类似于我的问题时,我偶然发现了以下讨论:使用Python在大型数据集上的模糊逻辑

Both of these datasets have "County" fields and my ask is this: is there a way to conditionally do a fuzzy string match on the "addressline" fields in both DFs based on the "county" fields being the same? Researching questions similar to mine I stumbled upon this discussion: Fuzzy logic on big datasets using Python

但是,我仍然不清楚如何根据县对字段进行分组/阻止.任何建议将不胜感激!

However I am still fuzzy (no pun intended) on how to go about grouping/blocking fields based on county. Any advice would be greatly appreciated!

import pandas as pd
from fuzzywuzzy import process

def fuzzy_match(x, choices, scorer, cutoff):
  return process.extractOne(x, choices = choices, scorer = scorer, score_cutoff= cutoff)[0]

test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'ID':['X','U','X','Y']}) 
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'ID':['X','U','X','Y']}) 
test['Address1'] = test['Address1'].apply(lambda x: x.lower()) 
test2['Address1'] = test2['Address1'].apply(lambda x: x.lower()) 
test['FuzzyAddress1'] = test['Address1'].apply(fuzzy_match, args = (test2['Address1'], fuzz.ratio, 80))

我添加了2张图像,这些图像是导入到Excel中的2种不同DF的样本集.并非所有字段都包括在内,因为它们对我的问题并不重要.为了重申我的最终目标,我想要一个DF中的新列,该列的最高结果是将地址线与第二DF中的其他地址线进行模糊匹配,但仅适用于两个DF之间的县匹配的行.我计划从那里合并两个df,一个合并在模糊匹配的地址上,第二个DF中的地址行列.希望这不会引起混淆.

I've added 2 images that are sample sets of the 2 different DFs imported into Excel. Not all the fields have been included since they aren't important to my question. To reiterate my end goal, I want a new column in one of the DFs that has the top result from fuzzy matching an address line with the other address lines in the 2nd DF but only for those lines where the counties match between both DFs. From there I plan to merge the two dfs, one on the fuzzy matched address and the address line column in the 2nd DF. Hopefully this doesn't sound confusing.

推荐答案

您可以修改您的 fuzzy_match 函数以将id用作变量,并使用它来对您的选择进行子集化,然后再进行模糊搜索(请注意,这需要将功能应用于整个数据框,而不仅仅是地址列)

You could adapt your fuzzy_match function to take the id as a variable and use this to subset your choices before doing the fuzzy search (note that this requires applying the function over the whole dataframe rather than just the address column)

def fuzzy_match(x, choices, scorer, cutoff):
    match = process.extractOne(x['Address1'], 
                               choices=choices.loc[choices['ID'] == x['ID'], 
                                                   'Address1'], 
                               scorer=scorer, 
                               score_cutoff=cutoff)
    if match:
        return match[0]

test['FuzzyAddress1'] = test.apply(fuzzy_match, 
                                   args=(test2, fuzz.ratio, 80), 
                                   axis=1)

这篇关于基于条件的2个大型数据集的模糊模糊字符串匹配-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆