如何在2m行上使用Fuzzywuzzy字符串匹配逻辑在python中进行多处理?当前代码非常慢 [英] How to do multiprocessing in python on 2m rows running fuzzywuzzy string matching logic? Current code is extremely slow
问题描述
我是python的新手,我正在具有200万记录的列表上运行 fuzzywuzzy字符串匹配逻辑. 代码正在工作,并且它也提供输出.问题是它非常慢.在3小时内,它仅处理80行.我想通过使其一次一次处理多行来加快处理速度.
I am new to python and I'm running a fuzzywuzzy string matching logic on a list with 2 million records. The code is working and it is giving output as well. The problem is that it is extremely slow. In 3 hours it processes only 80 rows. I want to speed things up by making it process multiple rows at once.
如果有帮助-我正在装有16Gb RAM和1.9 GHz双核CPU的计算机上运行它.
If it it helps - I am running it on my machine with 16Gb RAM and 1.9 GHz dual core CPU.
下面是我正在运行的代码.
Below is the code I'm running.
d = []
n = len(Africa_Company) #original list with 2m string records
for i in range(1,n):
choices = Africa_Company[i+1:n]
word = Africa_Company[i]
try:
output= process.extractOne(str(word), str(choices), score_cutoff=85)
except Exception:
print (word) #to identify which string is throwing an exception
print (i) #to know how many rows are processed, can do without this also
if output:
d.append({'Company':Africa_Company[i],
'NewCompany':output[0],
'Score':output[1],
'Region':'Africa'})
else:
d.append({'Company':Africa_Company[i],
'NewCompany':None,
'Score':None,
'Region':'Africa'})
Africa_Corrected = pd.DataFrame(d) #output data in a pandas dataframe
提前谢谢!
推荐答案
这是受CPU限制的问题.通过并行处理,您最多只能将其速度提高两倍(因为您有两个内核).您真正应该做的是加快单线程性能. Levenshtein距离非常慢,因此有很多机会可以加快速度.
This is a CPU-bound problem. By going parallel you can just speed it up by a factor of two at most (because you have two cores). What you really should do is speed up the single-thread performance. Levenshtein distance is quite slow so there are lots of opportunity to speed things up.
- 使用修剪.如果无法获得良好的结果,请不要尝试在两个字符串之间运行完整的Fuzzywuzzy匹配.尝试找到一种简单的线性算法,以在模糊模糊匹配之前过滤掉不相关的选择.
- 考虑建立索引.有什么方法可以索引列表?例如:如果您的匹配基于整个单词,则创建一个将单词映射到字符串的哈希图.仅尝试与与您的当前字符串至少有一个共同词的选择相匹配.
- 预处理.您可以对每次匹配中的字符串进行一些处理吗?例如,如果您的Levenshtein实施是从字符串中创建集合开始的,请考虑首先创建所有集合,这样您就不必在每次比赛中一遍又一遍地进行相同的工作.
- 是否有一些更好的算法可以使用?也许Levenshtein距离并不是开始时的最佳算法.
- 您使用的Levenshtein距离实现最佳吗?这返回到步骤3(预处理).您还可以采取其他措施来加快运行时间吗?
多处理将仅以恒定因子(取决于核心数)加速.索引可以将您带到更低复杂度的课程!因此,首先要专注于修剪和索引编制,然后是步骤3-5.仅当您从这些步骤中挤出足够的时间时,才应考虑进行多处理.
Multiprocessing will only speed up with a constant factor (depending on the number of cores). Indexing can take you to a lower complexity class! So focus on pruning and indexing first, then steps 3-5. Only when you squeezed enough out of these steps should you consider multiprocessing.
这篇关于如何在2m行上使用Fuzzywuzzy字符串匹配逻辑在python中进行多处理?当前代码非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!