如何在 Pandas 中使用 apply 并行化许多(模糊)字符串比较? [英] how to parallelize many (fuzzy) string comparisons using apply in Pandas?
问题描述
我有以下问题
我有一个包含句子的数据框 master,例如
master出[8]:原来的0 这是一个很好的句子1 这是另一个2 stackoverflow 很好
对于 Master 中的每一行,我使用 fuzzywuzzy
查找另一个 Dataframe slave 以获得最佳匹配.我使用了fuzzywuzzy,因为两个数据帧之间匹配的句子可能会有所不同(额外字符等).
例如,slave 可以是
从机出[10]:my_value 名称0 2 你好世界1 1 恭喜2 2 这是一个很好的句子3 3 这是另一个4 1 stackoverflow 很好
这是一个功能齐全、精彩、紧凑的工作示例:)
from fuzzywuzzy import fuzz将熊猫导入为 pd将 numpy 导入为 np导入差异库master= pd.DataFrame({'原文':['这是一个很好的句子','这是另一个','stackoverflow 很好']})slave= pd.DataFrame({'name':['hello world','恭喜','这是一个很好的句子','这是另一个','stackoverflow 很好'],'my_value': [2,1,2,3,1]})定义模糊分数(str1,str2):返回 fuzz.token_set_ratio(str1, str2)def helper(orig_string, slave_df):#使用fuzzywuzzy查看原始和名称的接近程度slave_df['score'] = slave_df.name.apply(lambda x:fuzzy_score(x,orig_string))#返回最高分对应的my_value返回 slave_df.ix[slave_df.score.idxmax(),'my_value']master['my_value'] = master.original.apply(lambda x: helper(x,slave))
100 万美元的问题是:我可以并行化我上面的应用代码吗?
毕竟,master
中的每一行都与 slave
中的所有行进行比较(slave 是一个小数据集,我可以将许多数据副本保存到 RAM 中).
我不明白为什么我不能运行多重比较(即同时处理多行).
问题:我不知道该怎么做,或者那是否可行.
非常感谢任何帮助!
您可以将其与 Dask.dataframe 并行化.
<预><代码>>>>dmaster = dd.from_pandas(master, npartitions=4)>>>dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))>>>dmaster.compute()原来的 my_value0 这是一个很好的句子 21 这是另一个 32 stackoverflow 很好 1此外,您应该在这里考虑使用线程与进程之间的权衡.您的模糊字符串匹配几乎肯定不会释放 GIL,因此您不会从使用多线程中获得任何好处.但是,使用进程会导致数据序列化并在您的机器上移动,这可能会减慢速度.
通过管理 compute()
方法的 get=
关键字参数,您可以在使用线程和进程或分布式系统之间进行试验.
import dask.multiprocessing导入 dask.threaded>>>dmaster.compute(get=dask.threaded.get) # 这是 dask.dataframe 的默认值>>>dmaster.compute(get=dask.multiprocessing.get) # 尝试进程代替
I have the following problem
I have a dataframe master that contains sentences, such as
master
Out[8]:
original
0 this is a nice sentence
1 this is another one
2 stackoverflow is nice
For every row in Master, I lookup into another Dataframe slave for the best match using fuzzywuzzy
. I use fuzzywuzzy because the matched sentences between the two dataframes could differ a bit (extra characters, etc).
For instance, slave could be
slave
Out[10]:
my_value name
0 2 hello world
1 1 congratulations
2 2 this is a nice sentence
3 3 this is another one
4 1 stackoverflow is nice
Here is a fully-functional, wonderful, compact working example :)
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
import difflib
master= pd.DataFrame({'original':['this is a nice sentence',
'this is another one',
'stackoverflow is nice']})
slave= pd.DataFrame({'name':['hello world',
'congratulations',
'this is a nice sentence ',
'this is another one',
'stackoverflow is nice'],'my_value': [2,1,2,3,1]})
def fuzzy_score(str1, str2):
return fuzz.token_set_ratio(str1, str2)
def helper(orig_string, slave_df):
#use fuzzywuzzy to see how close original and name are
slave_df['score'] = slave_df.name.apply(lambda x: fuzzy_score(x,orig_string))
#return my_value corresponding to the highest score
return slave_df.ix[slave_df.score.idxmax(),'my_value']
master['my_value'] = master.original.apply(lambda x: helper(x,slave))
The 1 million dollars question is: can I parallelize my apply code above?
After all, every row in master
is compared to all the rows in slave
(slave is a small dataset and I can hold many copies of the data into the RAM).
I dont see why I could not run multiple comparisons (i.e. process multiple rows at the same time).
Problem: I dont know how to do that or if thats even possible.
Any help greatly appreciated!
You can parallelize this with Dask.dataframe.
>>> dmaster = dd.from_pandas(master, npartitions=4)
>>> dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))
>>> dmaster.compute()
original my_value
0 this is a nice sentence 2
1 this is another one 3
2 stackoverflow is nice 1
Additionally, you should think about the tradeoffs between using threads vs processes here. Your fuzzy string matching almost certainly doesn't release the GIL, so you won't get any benefit from using multiple threads. However, using processes will cause data to serialize and move around your machine, which might slow things down a bit.
You can experiment between using threads and processes or a distributed system by managing the get=
keyword argument to the compute()
method.
import dask.multiprocessing
import dask.threaded
>>> dmaster.compute(get=dask.threaded.get) # this is default for dask.dataframe
>>> dmaster.compute(get=dask.multiprocessing.get) # try processes instead
这篇关于如何在 Pandas 中使用 apply 并行化许多(模糊)字符串比较?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!