如何在 Pandas 中使用 apply 并行化许多(模糊)字符串比较? [英] how to parallelize many (fuzzy) string comparisons using apply in Pandas?

查看:28
本文介绍了如何在 Pandas 中使用 apply 并行化许多(模糊)字符串比较?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下问题

我有一个包含句子的数据框 ma​​ster,例如

master出[8]:原来的0 这是一个很好的句子1 这是另一个2 stackoverflow 很好

对于 Master 中的每一行,我使用 fuzzywuzzy 查找另一个 Dataframe slave 以获得最佳匹配.我使用了fuzzywuzzy,因为两个数据帧之间匹配的句子可能会有所不同(额外字符等).

例如,slave 可以是

从机出[10]:my_value 名称0 2 你好世界1 1 恭喜2 2 这是一个很好的句子3 3 这是另一个4 1 stackoverflow 很好

这是一个功能齐全、精彩、紧凑的工作示例:)

from fuzzywuzzy import fuzz将熊猫导入为 pd将 numpy 导入为 np导入差异库master= pd.DataFrame({'原文':['这是一个很好的句子','这是另一个','stackoverflow 很好']})slave= pd.DataFrame({'name':['hello world','恭喜','这是一个很好的句子','这是另一个','stackoverflow 很好'],'my_value': [2,1,2,3,1]})定义模糊分数(str1,str2):返回 fuzz.token_set_ratio(str1, str2)def helper(orig_string, slave_df):#使用fuzzywuzzy查看原始和名称的接近程度slave_df['score'] = slave_df.name.apply(lambda x:fuzzy_score(x,orig_string))#返回最高分对应的my_value返回 slave_df.ix[slave_df.score.idxmax(),'my_value']master['my_value'] = master.original.apply(lambda x: helper(x,slave))

100 万美元的问题是:我可以并行化我上面的应用代码吗?

毕竟,master 中的每一行都与 slave 中的所有行进行比较(slave 是一个小数据集,我可以将许多数据副本保存到 RAM 中).

我不明白为什么我不能运行多重比较(即同时处理多行).

问题:我不知道该怎么做,或者那是否可行.

非常感谢任何帮助!

解决方案

您可以将其与 Dask.dataframe 并行化.

<预><代码>>>>dmaster = dd.from_pandas(master, npartitions=4)>>>dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))>>>dmaster.compute()原来的 my_value0 这是一个很好的句子 21 这是另一个 32 stackoverflow 很好 1

此外,您应该在这里考虑使用线程与进程之间的权衡.您的模糊字符串匹配几乎肯定不会释放 GIL,因此您不会从使用多线程中获得任何好处.但是,使用进程会导致数据序列化并在您的机器上移动,这可能会减慢速度.

通过管理 compute() 方法的 get= 关键字参数,您可以在使用线程和进程或分布式系统之间进行试验.

import dask.multiprocessing导入 dask.threaded>>>dmaster.compute(get=dask.threaded.get) # 这是 dask.dataframe 的默认值>>>dmaster.compute(get=dask.multiprocessing.get) # 尝试进程代替

I have the following problem

I have a dataframe master that contains sentences, such as

master
Out[8]: 
                  original
0  this is a nice sentence
1      this is another one
2    stackoverflow is nice

For every row in Master, I lookup into another Dataframe slave for the best match using fuzzywuzzy. I use fuzzywuzzy because the matched sentences between the two dataframes could differ a bit (extra characters, etc).

For instance, slave could be

slave
Out[10]: 
   my_value                      name
0         2               hello world
1         1           congratulations
2         2  this is a nice sentence 
3         3       this is another one
4         1     stackoverflow is nice

Here is a fully-functional, wonderful, compact working example :)

from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
import difflib


master= pd.DataFrame({'original':['this is a nice sentence',
'this is another one',
'stackoverflow is nice']})


slave= pd.DataFrame({'name':['hello world',
'congratulations',
'this is a nice sentence ',
'this is another one',
'stackoverflow is nice'],'my_value': [2,1,2,3,1]})

def fuzzy_score(str1, str2):
    return fuzz.token_set_ratio(str1, str2)

def helper(orig_string, slave_df):
    #use fuzzywuzzy to see how close original and name are
    slave_df['score'] = slave_df.name.apply(lambda x: fuzzy_score(x,orig_string))
    #return my_value corresponding to the highest score
    return slave_df.ix[slave_df.score.idxmax(),'my_value']

master['my_value'] = master.original.apply(lambda x: helper(x,slave))

The 1 million dollars question is: can I parallelize my apply code above?

After all, every row in master is compared to all the rows in slave (slave is a small dataset and I can hold many copies of the data into the RAM).

I dont see why I could not run multiple comparisons (i.e. process multiple rows at the same time).

Problem: I dont know how to do that or if thats even possible.

Any help greatly appreciated!

解决方案

You can parallelize this with Dask.dataframe.

>>> dmaster = dd.from_pandas(master, npartitions=4)
>>> dmaster['my_value'] = dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))
>>> dmaster.compute()
                  original  my_value
0  this is a nice sentence         2
1      this is another one         3
2    stackoverflow is nice         1

Additionally, you should think about the tradeoffs between using threads vs processes here. Your fuzzy string matching almost certainly doesn't release the GIL, so you won't get any benefit from using multiple threads. However, using processes will cause data to serialize and move around your machine, which might slow things down a bit.

You can experiment between using threads and processes or a distributed system by managing the get= keyword argument to the compute() method.

import dask.multiprocessing
import dask.threaded

>>> dmaster.compute(get=dask.threaded.get)  # this is default for dask.dataframe
>>> dmaster.compute(get=dask.multiprocessing.get)  # try processes instead

这篇关于如何在 Pandas 中使用 apply 并行化许多(模糊)字符串比较?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆