根据相似度替换字符串 [英] Replace strings based on similarity
问题描述
我试图用另一个列表中的字符串替换一个列表中的字符串.
I am trying to replace strings in one list with strings in another list.
strlist = ['D-astroid 3-cyclone', 'DL-astroid 3-cyclone', 'DL-astroid', 'D-comment', 'satellite']
to_match = ['astroid 3-cyclone', 'D-comment', 'D-astroid']
预期输出:
str_list = ['astroid 3-cyclone', 'astroid 3-cyclone', 'D-astroid', 'D-comment', 'satellite']
并输出包含映射的字典
dict =
{'astroid 3-cyclone':['astroid 3-cyclone', 'astroid 3-cyclone'],
'D-comment':'D-comment',
'D-astroid':'DL-astroid',
}
我正在尝试使用 difflib
以下列方式为测试用例实现它,
I am trying to implement it in the following way for a test case using difflib
,
from difflib import SequenceMatcher
from pprint import pprint
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
strlist = ['D-astroid 3-cyclone', 'DL-astroid 3-cyclone', 'DL-astroid', 'D-comment']
to_match = ['astroid 3-cyclone', 'D-comment', 'D-astroid']
similarity = similar('DL-astroid', 'astroid 3-cyclone')
pprint(similarity)
基本上,如果有大于 0.9 或 0.85 的相似度匹配,strlist
中的字符串必须替换为 to_match
列表中的字符串.可以使用两个 for
循环来检查 strlist
中的项目是否与 to_match
中的项目具有高相似率(>0.9).我不确定这是否是一种有效的实施方式.
Basically, if there is a similarity match of above 0.9 or 0.85, the string in strlist
has to be replaced with string in to_match
list. Could use two for
loops to check whether an item in strlist
has high similarity ratio (>0.9) with item in to_match
. I'm not sure if this is an efficient way to implement.
有什么建议吗?
我的尝试,但我不确定如何创建字典.
My try, I am not sure how to create the dictionary though.
from difflib import SequenceMatcher
from pprint import pprint
def similar(a, to_match):
percent_similarity = [SequenceMatcher(None, a, b).ratio() for b in to_match]
max_value_index = [i for i, j in enumerate(percent_similarity) if j == max(percent_similarity)][0]
map = [to_match[max_value_index] if max(percent_similarity) > 0.9 else a][0]
return map
strlist = ['D-saturn 6-pluto', 'D-astroid 3-cyclone', 'DL-astroid 3-cyclone', 'DL-astroid', 'D-comment', 'literal']
to_match = ['saturn 6-pluto', 'pluto', 'astroid 3-cyclone', 'D-comment', 'D-astroid']
map = [similar(item, to_match) for item in strlist]
pprint(map)
推荐答案
您可以从第二个列表中制作字典并将其应用于第一个:
You can make dictionary from the second list and apply it to the first:
strlist = ['D-astroid 3-cyclone', 'DL-astroid 3-cyclone', 'DL-astroid', 'D-comment', 'satellite']
to_match = ['astroid 3-cyclone', 'D-comment', 'D-astroid']
d1 = {i.split('-')[-1]:i for i in to_match}
result1 = [d1.get(i.split('-')[-1], i) for i in strlist]
result2 = {b:[i for i in strlist if i.endswith(a)] for a, b in d1.items()}
result2 = {a:b if len(b) != 1 else b[0] for a, b in result2.items()}
输出:
['astroid 3-cyclone', 'astroid 3-cyclone', 'D-astroid', 'D-comment', 'satellite']
{'astroid 3-cyclone': ['D-astroid 3-cyclone', 'DL-astroid 3-cyclone'], 'D-comment': 'D-comment', 'D-astroid': 'DL-astroid'}
这篇关于根据相似度替换字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!