在PANDAS列上向量化或加速Fuzzywuzzy字符串匹配 [英] Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column

查看:356
本文介绍了在PANDAS列上向量化或加速Fuzzywuzzy字符串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在包含机构名称的PANDAS列中查找潜在的匹配项.我目前正在使用iterrows(),但是在具有约70,000行的数据帧上它的速度非常慢.看完StackOverflow之后,我尝试实现一个lambda行(应用)方法,但这似乎根本无法加快速度.

I am trying to look for potential matches in a PANDAS column full of organization names. I am currently using iterrows() but it is extremely slow on a dataframe with ~70,000 rows. After having looked through StackOverflow I have tried implementing a lambda row (apply) method but that seems to barely speed things up, if at all.

数据帧的前四行如下:

index  org_name
0   cliftonlarsonallen llp minneapolis MN
1   loeb and troper llp newyork NY
2   dauby o'connor and zaleski llc carmel IN
3   wegner cpas llp madison WI

以下代码块可以工作,但大约需要五天的时间来处理:

The following code block works but took around five days to process:

org_list = df['org_name']
from fuzzywuzzy import process
for index, row in df.iterrows():
    x = process.extract(row['org_name'], org_list, limit=2)[1]
    if x[1]>93:
        df.loc[index, 'fuzzy_match'] = x[0]
        df.loc[index, 'fuzzy_match_score'] = x[1]

实际上,对于每一行,我都将组织名称与所有组织名称的列表进行比较,进行前两个匹配,然后选择倒数第二个匹配(因为顶部匹配将是相同的名称),然后设置条件,分数必须高于93才能创建新列.我创建其他列的原因是我不想简单地替换值-我想先仔细检查结果.

In effect, for each row I am comparing the organization name against the list of all organization names, taking the top two matches, then selecting the second-best match (because the top match will be the identical name), and then setting a condition that the score must be higher than 93 in order to create the new columns. The reason I'm creating additional columns is that I do not want to simply replace values -- I'd like to double-check the results first.

有没有办法加快速度?我读了一些博客文章和StackOverflow问题,这些问题讨论了向量化"此代码,但是我的尝试失败了.我还考虑过简单地创建一个70,000 x 70,000 Levenshtein距离矩阵,然后从那里提取信息.有没有一种更快的方法来为列表或PANDAS列中的每个元素生成最佳匹配?

Is there a way to speed this up? I read several blog posts and StackOverflow questions that talked about 'vectorizing' this code but my attempts at that failed. I also considered simply creating a 70,000 x 70,000 Levenshtein distance matrix and then extracting information from there. Is there a quicker way to generate the best match for each element in a list or PANDAS column?

推荐答案

使用fuzz.WRatio将70k个字符串相互比较,因此您的比较总数为4,900,000,000,其中每个比较都使用levenshtein距离内部模糊状态是O(N * M)运算. fuzz.WRatio是具有不同权重的多个不同字符串匹配率的组合.然后,在其中选择最佳比率.因此,它甚至必须多次计算Levenshtein距离.因此,一个目标应该是通过使用一种更快的匹配算法排除一些可能性来减少搜索空间.另一个问题是,对字符串进行了预处理,以删除标点符号并小写字符串.虽然这是匹配所必需的(例如,大写单词等于小写单词),但我们基本上可以提前进行此操作.因此,我们只需要预处理一次70k字符串.我将在此处使用 RapidFuzz 而不是FuzzyWuzzy,因为它的速度要快得多(我是作者)

Given your task your comparing 70k strings with each other using fuzz.WRatio, so your having a total of 4,900,000,000 comparisions, with each of these comparisions using the levenshtein distance inside fuzzywuzzy which is a O(N*M) operation. fuzz.WRatio is a combination of multiple different string matching ratios that have different weights. It then selects the best ratio among them. Therefore it even has to calculate the Levenshtein distance multiple times. So one goal should be to reduce the search space by excluding some possibilities using a way faster matching algorithm. Another issue is that the strings are preprocessed to remove punctuation and to lowercase the strings. While this is required for the matching (so e.g. a uppercased word becomes equal to a lowercased one) we can basically do this ahead of time. So we only have to preprocess the 70k strings once. I will use RapidFuzz instead of FuzzyWuzzy here, since it is quite a bit faster (I am the author).

以下版本应用了其中一些改进. 1)它会生成一个将组织映射到预处理组织的字典,因此不必在每次运行中都执行此操作 2)它会将score_cutoff传递给extractOne,因此可以跳过计算,因为它已经知道它们的长度差不能达到该比率

The following version applies some of these improvements. 1) it generates a dict mapping the organisations to the preprocessed organisations so this does not has to be done in each run 2) it passes a score_cutoff to extractOne so it can skip calculations where it already knows they can not reach this ratio by their length difference

import pandas as pd, numpy as np
from rapidfuzz import process, utils

org_list = df['org_name']
processed_orgs = {org: utils.default_process(org) for org in org_list}

for (i, (query, processed_query)) in enumerate(processed_orgs.items()):
    choices = processed_orgs.copy()
    del choices[query]
    match = process.extractOne(processed_query, choices, processor=None, score_cutoff=93)
    if match:
        df.loc[i, 'fuzzy_match'] = match[2]
        df.loc[i, 'fuzzy_match_score'] = match[1]

在我的实验中,此方法的执行速度是您以前的解决方案的十倍以上.加快速度的另一种方法是选择一种比率类型,该类型必须进行较少的计算,例如fuzz.ratio代替fuzz.WRatio.使用fuzz.ratio可以计算出更少的比率,并且可以在quickfuzz中使用fuzz.bitmap_ratio,这基本上是估算两个字符串之间的levenshtein比率的快速方法.这给我带来了大约3倍的改进,但缺点是,它仅计算整个字符串的比率,而fuzz.WRatio也使用基于令牌的比率,这有助于在两个组织名称中发现相似的单词.

In my experiments this performs more than 10 times as fast as your previous solution. Another way to speed it up would be to select a ratio type that has to do less calculations like e.g. fuzz.ratio instead of fuzz.WRatio. Using fuzz.ratio your calculating a lot less ratios and your able to use fuzz.bitmap_ratio in rapidfuzz, which is basically a quick way to estimate the levenshtein ratio between two strings. Which results in another about 3x improvement for me, but has the disadvantage, that it only calculates the ratio over the whole string, while fuzz.WRatio is using some token based ratios aswell, that help to spot similar words in the two organisation names.

import pandas as pd, numpy as np
from rapidfuzz import process, fuzz, utils

def quick_lev_filter(processed_query, choices, score_cutoff):
    for (choice, processed_choice) in choices.items():
        if not fuzz.quick_lev_ratio(processed_query, processed_choice, processor=None, score_cutoff=93):
            del choices[choice]

org_list = df['org_name']
processed_orgs = {org: utils.default_process(org) for org in org_list}

for (i, (query, processed_query)) in enumerate(processed_orgs.items()):
    choices = processed_orgs.copy()

    quick_lev_filter(processed_query, choices, score_cutoff=93)
    del choices[query]

    match = process.extractOne(processed_query, choices, scorer=fuzz.ratio, processor=None, score_cutoff=93)
    if match:
        df.loc[i, 'fuzzy_match'] = match[2]
        df.loc[i, 'fuzzy_match_score'] = match[1]

这篇关于在PANDAS列上向量化或加速Fuzzywuzzy字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆