是否可以与python pandas进行模糊匹配合并? [英] is it possible to do fuzzy match merge with python pandas?
问题描述
我有两个要基于列合并的DataFrame.但是,由于其他拼写方式,空格数量不同,不存在变音标记,只要它们彼此相似,我希望能够合并.
I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.
任何相似性算法都可以使用(soundex,Levenshtein,difflib).
Any similarity algorithm will do (soundex, Levenshtein, difflib's).
假设一个DataFrame具有以下数据:
Say one DataFrame has the following data:
df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
number
one 1
two 2
three 3
four 4
five 5
df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
letter
one a
too b
three c
fours d
five e
然后我要获取结果DataFrame
Then I want to get the resulting DataFrame
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
推荐答案
类似于@locojay建议,您可以应用 difflib
的 get_close_matches
到df2
的索引,然后应用
Similar to @locojay suggestion, you can apply difflib
's get_close_matches
to df2
's index and then apply a join
:
In [23]: import difflib
In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>
In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
In [26]: df2
Out[26]:
letter
one a
two b
three c
four d
five e
In [31]: df1.join(df2)
Out[31]:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
.
If these were columns, in the same vein you could apply to the column then merge
:
df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])
df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)
这篇关于是否可以与python pandas进行模糊匹配合并?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!