从两个可变长度的字符串数组中返回相似度矩阵(scipy选项?) [英] Return Similarity Matrix From Two Variable-length Arrays of Strings (scipy option?)
问题描述
说我有两个数组:
import numpy as np
arr1 = np.array(['faucet', 'faucets', 'bath', 'parts', 'bathroom'])
arr2 = np.array(['faucett', 'faucetd', 'bth', 'kichen'])
,我想计算 arr2
中的字符串与 arr1
中的字符串的相似度.
and I want to compute the similarity of the strings in arr2
to the strings in arr1
.
arr1
是由正确拼写的单词组成的数组.
arr1
is an array of correctly spelled words.
arr2
是单词词典中无法识别的单词数组.
arr2
is an array of words not recognized in a dictionary of words.
我想返回一个矩阵,然后将其转换为熊猫DataFrame.
I want to return a matrix which will then be turned into a pandas DataFrame.
我当前的解决方案(信用):
My current solution (credit):
from scipy.spatial.distance import pdist, squareform
from Levenshtein import ratio
arr3 = np.concatenate((arr1, arr2)).reshape(-1,1)
matrix = squareform(pdist(arr3, lambda x,y: ratio(x[0], y[0])))
df = pd.DataFrame(matrix, index=arr3.ravel(), columns=arr3.ravel())
输出:
faucet faucets bath parts bathroom faucett \
faucet 0.000000 0.923077 0.400000 0.363636 0.285714 0.923077
faucets 0.923077 0.000000 0.363636 0.500000 0.266667 0.857143
bath 0.400000 0.363636 0.000000 0.444444 0.666667 0.363636
parts 0.363636 0.500000 0.444444 0.000000 0.307692 0.333333
bathroom 0.285714 0.266667 0.666667 0.307692 0.000000 0.266667
faucett 0.923077 0.857143 0.363636 0.333333 0.266667 0.000000
faucetd 0.923077 0.857143 0.363636 0.333333 0.266667 0.857143
bth 0.222222 0.200000 0.857143 0.250000 0.545455 0.200000
kichen 0.333333 0.307692 0.200000 0.000000 0.142857 0.307692
faucetd bth kichen
faucet 0.923077 0.222222 0.333333
faucets 0.857143 0.200000 0.307692
bath 0.363636 0.857143 0.200000
parts 0.333333 0.250000 0.000000
bathroom 0.266667 0.545455 0.142857
faucett 0.857143 0.200000 0.307692
faucetd 0.000000 0.200000 0.307692
bth 0.200000 0.000000 0.222222
kichen 0.307692 0.222222 0.000000
此解决方案的问题:我浪费时间在已经知道正确拼写的单词上计算成对的距离比率.
The problem with this solution: I waste time computing pairwise distance ratios on words I already know are correctly spelled.
我想要给函数 arr1
和 arr2
(可以是不同的长度!),并输出具有比率的矩阵(不一定是正方形)
What I'd like is to hand a function arr1
and arr2
(which can be different lengths!) and output a matrix (not necessarily square) with the ratios.
结果看起来像这样(没有计算开销):
The result would look like this (without the computational overhead):
>>> df.drop(index=arr1, columns=arr2)
faucet faucets bath parts bathroom
faucett 0.923077 0.857143 0.363636 0.333333 0.266667
faucetd 0.923077 0.857143 0.363636 0.333333 0.266667
bth 0.222222 0.200000 0.857143 0.250000 0.545455
kichen 0.333333 0.307692 0.200000 0.000000 0.142857
推荐答案
I think you're looking for cdist
:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
from Levenshtein import ratio
arr1 = np.array(['faucet', 'faucets', 'bath', 'parts', 'bathroom'])
arr2 = np.array(['faucett', 'faucetd', 'bth', 'kichen'])
matrix = cdist(arr2.reshape(-1, 1), arr1.reshape(-1, 1), lambda x, y: ratio(x[0], y[0]))
df = pd.DataFrame(data=matrix, index=arr2, columns=arr1)
结果:
faucet faucets bath parts bathroom
faucett 0.923077 0.857143 0.363636 0.333333 0.266667
faucetd 0.923077 0.857143 0.363636 0.333333 0.266667
bth 0.222222 0.200000 0.857143 0.250000 0.545455
kichen 0.333333 0.307692 0.200000 0.000000 0.142857
这篇关于从两个可变长度的字符串数组中返回相似度矩阵(scipy选项?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!