如何根据最接近的匹配有效地替换另一个大数据帧(100k+ 行)中的值? [英] How to efficiently replace values in a large dataframe (100k+ rows) from another based on closest match?

查看:51
本文介绍了如何根据最接近的匹配有效地替换另一个大数据帧(100k+ 行)中的值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我使用 levenshire 距离找到最接近的匹配并使用 this answer 替换大数据框中的许多值作为一个基地:

So I am using levenshire distance to find closest match and replace many values in a large data frame using this answer as a base:

import operator

def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

def closest_match(string, matchings):
    scores = {}
    for m in matchings:
        scores[m] = 1 - levenshteinDistance(string,m)
    
    return max(scores.items(), key=operator.itemgetter(1))[0]

因此,当从另一个类似大小的数据帧(100k+ 行)中替换许多值时,如下需要永远运行:(从过去半小时开始运行!)

So while replacing many values in a moderately large dataframe (100k+ rows) from another of the similar size as follows takes forever to run: (Running from last half hour ha!)

results2.products = [closest_match(string, results2.products) 
                    if string not in results2.products else string 
                    for string in results.products]

那么有没有办法更有效地做到这一点?我为了同样的目的添加了 if-else 条件,这样如果有直接匹配,就不会涉及任何也会产生相同结果的计算.

So is there a way to do this more efficiently? I added if-else condition for the same purpose so that if there is a direct match there wont be any calculations involved which also would have produced same result.

结果:

   products
0, pizza
1, ketchup
2, salami
3, anchovy
4, pepperoni
5, marinara
6, olive
7, sausage
8, cheese
9, bbq sauce
10, stuffed crust

results2:

   products
0, salaaaami
1, kechap
2, lives
3, ppprn
4, pizzas
5, marinara
6, sauce de bbq
7, marinara sauce
8, chease
9, sausages
10, crust should be stuffed

我希望 results2 中的值被 results

推荐答案

所以我采取了一些措施来提高速度并获得了近 4800 倍的加速.

在此发帖以使任何处理 Pandas 上任何 CPU 密集型任务性能缓慢的人受益:

So I took a couple of measures to improve speed and have almost 4800x speedup.

Posting here to benefit anyone dealing with slow performance of any CPU-intensive task on pandas:

  1. 我没有像问题中那样一次性全部替换,而是创建了一个替换字典来替换每个数据帧中的唯一值,这使得它从永远(我在 2 小时后停止)变成了 2 分钟因为有很多重复的值.这是 60 倍的加速:

  1. Instead of replacing all at once as in the question, I made a replacement dictionary to make replacements with unique values in each dataframe which made it to go from taking forever (I stopped after 2 hours) to a 2 minutes as there are many many repeated values. Thats a 60x speedup:

replacements = {string: closest_match(string, results2.products.unique())
              if string not in results2.products.unique() else string 
                for string in results.products.unique()}
results.replace({'products':replacements}, inplace = True)

  • 我使用了基于 c 的计算 levenshtein 距离的实现:编辑距离 库.在研究中,我发现许多此类任务都有基于 C 的实现,例如矩阵乘法和搜索算法等.此外,您始终可以用 C 编写模块并在 python 中使用它.editdistance.eval('banana', 'bahama') 每个循环仅花费 1.71 µs ± 289 ns(平均值 ± 标准偏差,7 次运行,每次 100000 次循环)与我定义的函数 levenshteinDistance('banana', 'bahama') 相比,每个循环花费了 34.4 µs ± 4.2 µs(平均值 ± 标准开发,7 次运行,每个循环 10000 次) 那是 20 倍的加速.

  • I used a c-based implementation of calculating levenshtein distance making use of : editdistance library. On research I found that many such tasks have C-based implementations like matrix-multiplication and search algorithms etc. are readily available. Besides you can always write a module in C and make use of it in python. editdistance.eval('banana', 'bahama') took only 1.71 µs ± 289 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) in comparison with my defined function levenshteinDistance('banana', 'bahama') which took 34.4 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) That’s a 20x speedup.

    然后我通过并行性一次性使用了我的所有内核.为此,我经历了各种选择,例如多处理和线程,但没有一个比 modin.pandas 的速度快.它的最小变化(只需一行 import modin.pands as pd 代替 import pandas as pd)并且运行优雅.它使之前的运行速度提高了大约 4 倍.

    Then I made use of all of my cores at once by parallelism. For that I have gone through various alternatives e.g. multiprocessing and threading but none of them gave a comparison as fast as of modin.pandas. Its minimal changes (just one line import modin.pands as pd in place of import pandas as pd) and work elegantly. It made previous runs around 4x faster.

    因此总共实现了 4800 倍的加速,这是巨大的,并且整个过程都在眨眼间运行.

    Thus totaling a 4800x speedup which is massive and whole thing is running in a blink.

    这篇关于如何根据最接近的匹配有效地替换另一个大数据帧(100k+ 行)中的值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    相关文章
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆