如何将大 pandas 的Levenshtein距离度量与一列的不同行进行比较? [英] How can I compare different rows of one column with Levenshtein distance metric in pandas?
本文介绍了如何将大 pandas 的Levenshtein距离度量与一列的不同行进行比较?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个这样的表:
id name
1 gfh
2 bob
3 boby
4 hgf
等
我想知道如何使用Levenshtein指标比较名称"列的不同行?
I am wondering how can I use Levenshtein metric to compare different rows of my 'name' column?
我已经知道我可以使用它来比较列:
I already know that I can use this to compare columns:
L.distance('Hello, Word!', 'Hallo, World!')
但是行呢?
推荐答案
这是使用pandas和numpy的一种方法:
Here is a way to do it with pandas and numpy:
from numpy import triu, ones
t = """id name
1 gfh
2 bob
3 boby
4 hgf"""
df = pd.read_csv(pd.core.common.StringIO(t), sep='\s{1,}').set_index('id')
print df
name
id
1 gfh
2 bob
3 boby
4 hgf
创建带有字符串列表以确保距离的数据框:
Create dataframe with list of strings to mesure distance:
dfs = pd.DataFrame([df.name.tolist()] * df.shape[0], index=df.index, columns=df.index)
dfs = dfs.applymap(lambda x: list([x]))
print dfs
id 1 2 3 4
id
1 [gfh] [bob] [boby] [hgf]
2 [gfh] [bob] [boby] [hgf]
3 [gfh] [bob] [boby] [hgf]
4 [gfh] [bob] [boby] [hgf]
混合列出以形成具有所有变体的矩阵,并将右上角设为NaN:
Mix lists to form a matrix with all variations and make upper right corner as NaNs:
dfd = dfs + dfs.T
dfd = dfd.mask(triu(ones(dfd.shape)).astype(bool))
print dfd
id 1 2 3 4
id
1 NaN NaN NaN NaN
2 [gfh, bob] NaN NaN NaN
3 [gfh, boby] [bob, boby] NaN NaN
4 [gfh, hgf] [bob, hgf] [boby, hgf] NaN
测量L.distance
:
dfd.applymap(lambda x: L.distance(x[0], x[1]))
这篇关于如何将大 pandas 的Levenshtein距离度量与一列的不同行进行比较?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文