如何在Python中将Levenshtein距离大于80%的单词分组 [英] How to group words whose Levenshtein distance is more than 80 percent in Python
问题描述
假设我有一个列表:-
person_name = ['zakesh', 'oldman LLC', 'bikash', 'goldman LLC', 'zikash','rakesh']
我正在尝试以这种方式对列表进行分组,以使两个字符串之间的 Levenshtein距离最大.为了找出两个单词之间的比例,我使用了python包 fuzzywuzzy .
I am trying to group the list in such a way so the Levenshtein distance between two strings is maximum. For finding out the ratio between two words, I am using a python package fuzzywuzzy.
示例:-
>>> from fuzzywuzzy import fuzz
>>> combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
>>> fuzz.ratio('goldman LLC', 'oldman LLC')
95
>>> fuzz.ratio('rakesh', 'zakesh')
83
>>> fuzz.ratio('bikash', 'zikash')
83
>>>
我的最终目标:
我的最终目标是对单词进行分组,使它们之间的Levenshtein距离大于80%?
My end goal is to group the words such that Levenshtein distance between them is more than 80 percent?
我的列表应如下所示:-
My list should look something like this :-
person_name = ['bikash', 'zikash', 'rakesh', 'zakesh', 'goldman LLC', 'oldman LLC'] because the distance between `bikash` and `zikash` is very high so they should be together.
代码:
我正在尝试通过排序来实现这一点,但是关键功能应该是fuzz.ratio
.下面的代码无法正常工作,但是我正从这个角度解决问题.
I am trying to achieve this by sorting but key function should be fuzz.ratio
. Well below code is not working, but I am approaching the problem in this angle.
from fuzzywuzzy import fuzz
combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.sort(key=lambda x, y: fuzz.ratio(x, y))
print combined_list
有人可以帮我把这些单词组合起来,以使它们之间的Levenshtein距离超过80%吗?
Could anyone help me to combine the words so that Levenshtein distance between them is more than 80 percent?
推荐答案
这会将名称分组
from fuzzywuzzy import fuzz
combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.append('bakesh')
print('input names:', combined_list)
grs = list() # groups of names with distance > 80
for name in combined_list:
for g in grs:
if all(fuzz.ratio(name, w) > 80 for w in g):
g.append(name)
break
else:
grs.append([name, ])
print('output groups:', grs)
outlist = [el for g in grs for el in g]
print('output list:', outlist)
生产
input names: ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC', 'bakesh']
output groups: [['rakesh', 'zakesh', 'bakesh'], ['bikash', 'zikash'], ['goldman LLC', 'oldman LLC']]
output list: ['rakesh', 'zakesh', 'bakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
如您所见,名称已正确分组,但顺序可能不是您想要的.
As you can see, the names are grouped correctly, but the order may not be the one you desire.
这篇关于如何在Python中将Levenshtein距离大于80%的单词分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!