如何将大 pandas 的Levenshtein距离度量与一列的不同行进行比较? [英] How can I compare different rows of one column with Levenshtein distance metric in pandas?

查看:66
本文介绍了如何将大 pandas 的Levenshtein距离度量与一列的不同行进行比较?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的表:

id name
1 gfh
2 bob
3 boby
4 hgf

我想知道如何使用Levenshtein指标比较名称"列的不同行?

I am wondering how can I use Levenshtein metric to compare different rows of my 'name' column?

我已经知道我可以使用它来比较列:

I already know that I can use this to compare columns:

L.distance('Hello, Word!', 'Hallo, World!')

但是行呢?

推荐答案

这是使用pandas和numpy的一种方法:

Here is a way to do it with pandas and numpy:

from numpy import triu, ones
t = """id name
1 gfh
2 bob
3 boby
4 hgf"""

df = pd.read_csv(pd.core.common.StringIO(t), sep='\s{1,}').set_index('id')
print df

        name
id      
1    gfh
2    bob
3   boby
4    hgf

创建带有字符串列表以确保距离的数据框:

Create dataframe with list of strings to mesure distance:

dfs = pd.DataFrame([df.name.tolist()] * df.shape[0], index=df.index, columns=df.index)
dfs = dfs.applymap(lambda x: list([x]))
print dfs

    id      1      2       3      4
id                             
1   [gfh]  [bob]  [boby]  [hgf]
2   [gfh]  [bob]  [boby]  [hgf]
3   [gfh]  [bob]  [boby]  [hgf]
4   [gfh]  [bob]  [boby]  [hgf]

混合列出以形成具有所有变体的矩阵,并将右上角设为NaN:

Mix lists to form a matrix with all variations and make upper right corner as NaNs:

dfd = dfs + dfs.T
dfd = dfd.mask(triu(ones(dfd.shape)).astype(bool))
print dfd

id            1            2            3    4
id                                            
1           NaN          NaN          NaN  NaN
2    [gfh, bob]          NaN          NaN  NaN
3   [gfh, boby]  [bob, boby]          NaN  NaN
4    [gfh, hgf]   [bob, hgf]  [boby, hgf]  NaN

测量L.distance:

dfd.applymap(lambda x: L.distance(x[0], x[1]))

这篇关于如何将大 pandas 的Levenshtein距离度量与一列的不同行进行比较?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆