基于最大Levenshtien距离的最可能单词 [英] Most Likely Word Based on Max Levenshtien Distance
问题描述
我有一个list
的单词:
lst = ['dog', 'cat', 'mate', 'mouse', 'zebra', 'lion']
我也有一个pandas
数据框:
df = pd.DataFrame({'input': ['dog', 'kat', 'leon', 'moues'], 'suggested_class': ['a', 'a', 'a', 'a']})
input suggested_class
dog a
kat a
leon a
moues a
我想用lst
中的值填充suggested_class
列,该值与input
列中的单词具有最高的levenshtein距离.我正在使用fuzzywuzzy
软件包进行计算.
I would like to populate the suggested_class
column with the value from lst
that has the highest levenshtein distance to a word in the input
column. I am using the fuzzywuzzy
package to calculate that.
预期输出为:
input suggested_class
dog dog
kat cat
leon lion
moues mouse
我知道可以使用df.suggested_class = [autocorrect.spell(w) for w in df.input]
包(如df.suggested_class = [autocorrect.spell(w) for w in df.input]
)来实现某些功能,但这不适用于我的情况.
I'm aware that one could implement something with the autocorrect
package like df.suggested_class = [autocorrect.spell(w) for w in df.input]
but this would not work for my situation.
我已经尝试过这样的事情(使用from fuzzywuzzy import fuzz
):
I've tried something like this (using from fuzzywuzzy import fuzz
):
for word in lst:
for n in range(0, len(df.input)):
if fuzz.ratio(df.input.iloc[n], word) >= 70:
df.suggested_class.iloc[n] = word
else:
df.suggested_class.iloc[n] = "unknown"
仅适用于设定的距离.我已经可以通过以下方式捕获最大距离:
which only works for a set distance. I've been able to capture the max distance with:
max([fuzz.ratio(df.input.iloc[0], word) for word in lst])
但是在将其与第一个单词联系起来时遇到了麻烦,随后在该单词中填充了suggested_class
.
but am having trouble relating that to a word from lst, and subsequently populating suggested_class
with that word.
推荐答案
自从您提到fuzzywuzzy
from fuzzywuzzy import process
df['suggested_class']=df.input.apply(lambda x : [process.extract(x, lst, limit=1)][0][0][0])
df
Out[1365]:
input suggested_class
0 dog dog
1 kat cat
2 leon lion
3 moues mouse
这篇关于基于最大Levenshtien距离的最可能单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!