如何在2m行上使用Fuzzywuzzy字符串匹配逻辑在python中进行多处理?当前代码非常慢 [英] How to do multiprocessing in python on 2m rows running fuzzywuzzy string matching logic? Current code is extremely slow

查看：458 发布时间：2020/5/13 20:08:58 python performance multiprocessing fuzzywuzzy

本文介绍了如何在2m行上使用Fuzzywuzzy字符串匹配逻辑在python中进行多处理?当前代码非常慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是python的新手，我正在具有200万记录的列表上运行 fuzzywuzzy字符串匹配逻辑. 代码正在工作，并且它也提供输出.问题是它非常慢.在3小时内，它仅处理80行.我想通过使其一次一次处理多行来加快处理速度.

I am new to python and I'm running a fuzzywuzzy string matching logic on a list with 2 million records. The code is working and it is giving output as well. The problem is that it is extremely slow. In 3 hours it processes only 80 rows. I want to speed things up by making it process multiple rows at once.

如果有帮助-我正在装有16Gb RAM和1.9 GHz双核CPU的计算机上运行它.

If it it helps - I am running it on my machine with 16Gb RAM and 1.9 GHz dual core CPU.

下面是我正在运行的代码.

Below is the code I'm running.

d = [] n = len(Africa_Company) #original list with 2m string records for i in range(1,n): choices = Africa_Company[i+1:n] word = Africa_Company[i] try: output= process.extractOne(str(word), str(choices), score_cutoff=85) except Exception: print (word) #to identify which string is throwing an exception print (i) #to know how many rows are processed, can do without this also if output: d.append({'Company':Africa_Company[i], 'NewCompany':output[0], 'Score':output[1], 'Region':'Africa'}) else: d.append({'Company':Africa_Company[i], 'NewCompany':None, 'Score':None, 'Region':'Africa'}) Africa_Corrected = pd.DataFrame(d) #output data in a pandas dataframe

提前谢谢！

推荐答案

这是受CPU限制的问题.通过并行处理，您最多只能将其速度提高两倍(因为您有两个内核).您真正应该做的是加快单线程性能. Levenshtein距离非常慢，因此有很多机会可以加快速度.

This is a CPU-bound problem. By going parallel you can just speed it up by a factor of two at most (because you have two cores). What you really should do is speed up the single-thread performance. Levenshtein distance is quite slow so there are lots of opportunity to speed things up.

使用修剪.如果无法获得良好的结果，请不要尝试在两个字符串之间运行完整的Fuzzywuzzy匹配.尝试找到一种简单的线性算法，以在模糊模糊匹配之前过滤掉不相关的选择.

考虑建立索引.有什么方法可以索引列表?例如:如果您的匹配基于整个单词，则创建一个将单词映射到字符串的哈希图.仅尝试与与您的当前字符串至少有一个共同词的选择相匹配.

预处理.您可以对每次匹配中的字符串进行一些处理吗?例如，如果您的Levenshtein实施是从字符串中创建集合开始的，请考虑首先创建所有集合，这样您就不必在每次比赛中一遍又一遍地进行相同的工作.

是否有一些更好的算法可以使用?也许Levenshtein距离并不是开始时的最佳算法.

您使用的Levenshtein距离实现最佳吗?这返回到步骤3(预处理).您还可以采取其他措施来加快运行时间吗?

多处理将仅以恒定因子(取决于核心数)加速.索引可以将您带到更低复杂度的课程！因此，首先要专注于修剪和索引编制，然后是步骤3-5.仅当您从这些步骤中挤出足够的时间时，才应考虑进行多处理.

Multiprocessing will only speed up with a constant factor (depending on the number of cores). Indexing can take you to a lower complexity class! So focus on pruning and indexing first, then steps 3-5. Only when you squeezed enough out of these steps should you consider multiprocessing.

这篇关于如何在2m行上使用Fuzzywuzzy字符串匹配逻辑在python中进行多处理?当前代码非常慢的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在2m行上使用Fuzzywuzzy字符串匹配逻辑在python中进行多处理?当前代码非常慢 [英] How to do multiprocessing in python on 2m rows running fuzzywuzzy string matching logic? Current code is extremely slow

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在2m行上使用Fuzzywuzzy字符串匹配逻辑在python中进行多处理?当前代码非常慢 [英] How to do multiprocessing in python on 2m rows running fuzzywuzzy string matching logic? Current code is extremely slow

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭