如何在2m行上使用Fuzzywuzzy字符串匹配逻辑在python中进行多处理?当前代码非常慢 [英] How to do multiprocessing in python on 2m rows running fuzzywuzzy string matching logic? Current code is extremely slow

查看:458
本文介绍了如何在2m行上使用Fuzzywuzzy字符串匹配逻辑在python中进行多处理?当前代码非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python的新手,我正在具有200万记录的列表上运行 fuzzywuzzy字符串匹配逻辑. 代码正在工作,并且它也提供输出.问题是它非常慢.在3小时内,它仅处理80行.我想通过使其一次一次处理多行来加快处理速度.

I am new to python and I'm running a fuzzywuzzy string matching logic on a list with 2 million records. The code is working and it is giving output as well. The problem is that it is extremely slow. In 3 hours it processes only 80 rows. I want to speed things up by making it process multiple rows at once.

如果有帮助-我正在装有16Gb RAM和1.9 GHz双核CPU的计算机上运行它.

If it it helps - I am running it on my machine with 16Gb RAM and 1.9 GHz dual core CPU.

下面是我正在运行的代码.

Below is the code I'm running.

d = []
n = len(Africa_Company) #original list with 2m string records
for i in range(1,n):
    choices = Africa_Company[i+1:n]
    word = Africa_Company[i]
    try:
        output= process.extractOne(str(word), str(choices), score_cutoff=85)
    except Exception:
        print (word) #to identify which string is throwing an exception
    print (i) #to know how many rows are processed, can do without this also
    if output:
        d.append({'Company':Africa_Company[i], 
                  'NewCompany':output[0],
                  'Score':output[1], 
                  'Region':'Africa'})
    else:
        d.append({'Company':Africa_Company[i], 
                  'NewCompany':None,
                  'Score':None, 
                  'Region':'Africa'})


Africa_Corrected = pd.DataFrame(d) #output data in a pandas dataframe

提前谢谢!

推荐答案

这是受CPU限制的问题.通过并行处理,您最多只能将其速度提高两倍(因为您有两个内核).您真正应该做的是加快单线程性能. Levenshtein距离非常慢,因此有很多机会可以加快速度.

This is a CPU-bound problem. By going parallel you can just speed it up by a factor of two at most (because you have two cores). What you really should do is speed up the single-thread performance. Levenshtein distance is quite slow so there are lots of opportunity to speed things up.

  1. 使用修剪.如果无法获得良好的结果,请不要尝试在两个字符串之间运行完整的Fuzzywuzzy匹配.尝试找到一种简单的线性算法,以在模糊模糊匹配之前过滤掉不相关的选择.
  2. 考虑建立索引.有什么方法可以索引列表?例如:如果您的匹配基于整个单词,则创建一个将单词映射到字符串的哈希图.仅尝试与与您的当前字符串至少有一个共同词的选择相匹配.
  3. 预处理.您可以对每次匹配中的字符串进行一些处理吗?例如,如果您的Levenshtein实施是从字符串中创建集合开始的,请考虑首先创建所有集合,这样您就不必在每次比赛中一遍又一遍地进行相同的工作.
  4. 是否有一些更好的算法可以使用?也许Levenshtein距离并不是开始时的最佳算法.
  5. 您使用的Levenshtein距离实现最佳吗?这返回到步骤3(预处理).您还可以采取其他措施来加快运行时间吗?

多处理将仅以恒定因子(取决于核心数)加速.索引可以将您带到更低复杂度的课程!因此,首先要专注于修剪和索引编制,然后是步骤3-5.仅当您从这些步骤中挤出足够的时间时,才应考虑进行多处理.

Multiprocessing will only speed up with a constant factor (depending on the number of cores). Indexing can take you to a lower complexity class! So focus on pruning and indexing first, then steps 3-5. Only when you squeezed enough out of these steps should you consider multiprocessing.

这篇关于如何在2m行上使用Fuzzywuzzy字符串匹配逻辑在python中进行多处理?当前代码非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆