Python中使用pdist的字符串距离矩阵 [英] String Distance Matrix in Python using pdist

查看:776
本文介绍了Python中使用pdist的字符串距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在Python中计算字符串的Jaro Winkler距离矩阵?

How to calculate Jaro Winkler distance matrix of strings in Python?

我有很多手工输入的字符串(名称和记录号),我试图在列表中查找重复项,包括可能在拼写上稍有不同的重复项.建议使用Scipy的pdist函数和自定义距离函数来回答类似的问题.我尝试使用Levenshtein软件包中的jaro_winkler函数来实现此解决方案.问题在于jaro_winkler函数需要字符串输入,而pdict函数似乎需要2D数组输入.

I have a large array of hand-entered strings (names and record numbers) and I'm trying to find duplicates in the list, including duplicates that may have slight variations in spelling. A response to a similar question suggested using Scipy's pdist function with a custom distance function. I've tried to implement this solution with the jaro_winkler function in the Levenshtein package. The problem with this is that the jaro_winkler function requires a string input, whereas the pdict function seems to require a 2D array input.

示例:

import numpy as np
from scipy.spatial.distance import pdist
from Levenshtein import jaro_winkler

fname = np.array(['Bob','Carl','Kristen','Calr', 'Doug']).reshape(-1,1)
dm = pdist(fname, jaro_winkler)
dm = squareform(dm)

预期的输出-像这样的东西:

Expected Output - Something like this:

          Bob  Carl   Kristen  Calr  Doug
Bob       1.0   -        -       -     -
Carl      0.0   1.0      -       -     -
Kristen   0.0   0.46    1.0      -     -
Calr      0.0   0.93    0.46    1.0    -
Doug      0.53  0.0     0.0     0.0   1.0

实际错误:

jaro_winkler expected two Strings or two Unicodes

我认为这是因为jaro_winkler函数看到的是ndarray而不是字符串,并且我不确定如何在pdist函数的上下文中将函数输入转换为字符串.

I'm assuming this is because the jaro_winkler function is seeing an ndarray instead of a string, and I'm not sure how to convert the function input to a string in the context of the pdist function.

有没有人建议允许它工作?预先感谢!

Does anyone have a suggestion to allow this to work? Thanks in advance!

推荐答案

您需要包装距离函数,如我在下面的示例中使用Levensthein距离演示的

You need to wrap the distance function, like I demonstrated in the following example with the Levensthein distance

import numpy as np    
from Levenshtein import distance
from scipy.spatial.distance import pdist, squareform

# my list of strings
strings = ["hello","hallo","choco"]

# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1)) 
transformed_strings = np.array(strings).reshape(-1,1)

# calculate condensed distance matrix by wrapping the Levenshtein distance function
distance_matrix = pdist(transformed_strings,lambda x,y: distance(x[0],y[0]))

# get square matrix
print(squareform(distance_matrix))

Output:
array([[ 0.,  1.,  4.],
       [ 1.,  0.,  4.],
       [ 4.,  4.,  0.]])

这篇关于Python中使用pdist的字符串距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆