从两个可变长度的字符串数组中返回相似度矩阵(scipy选项?) [英] Return Similarity Matrix From Two Variable-length Arrays of Strings (scipy option?)

查看:72
本文介绍了从两个可变长度的字符串数组中返回相似度矩阵(scipy选项?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有两个数组:

import numpy as np
arr1 = np.array(['faucet', 'faucets', 'bath', 'parts', 'bathroom'])
arr2 = np.array(['faucett', 'faucetd', 'bth', 'kichen'])

,我想计算 arr2 中的字符串与 arr1 中的字符串的相似度.

and I want to compute the similarity of the strings in arr2 to the strings in arr1.

arr1 是由正确拼写的单词组成的数组.

arr1 is an array of correctly spelled words.

arr2 是单词词典中无法识别的单词数组.

arr2 is an array of words not recognized in a dictionary of words.

我想返回一个矩阵,然后将其转换为熊猫DataFrame.

I want to return a matrix which will then be turned into a pandas DataFrame.

我当前的解决方案(信用):

My current solution (credit):

from scipy.spatial.distance import pdist, squareform
from Levenshtein import ratio
arr3 = np.concatenate((arr1, arr2)).reshape(-1,1)
matrix = squareform(pdist(arr3, lambda x,y: ratio(x[0], y[0])))
df = pd.DataFrame(matrix, index=arr3.ravel(), columns=arr3.ravel())

输出:

            faucet   faucets      bath     parts  bathroom   faucett  \
faucet    0.000000  0.923077  0.400000  0.363636  0.285714  0.923077   
faucets   0.923077  0.000000  0.363636  0.500000  0.266667  0.857143   
bath      0.400000  0.363636  0.000000  0.444444  0.666667  0.363636   
parts     0.363636  0.500000  0.444444  0.000000  0.307692  0.333333   
bathroom  0.285714  0.266667  0.666667  0.307692  0.000000  0.266667   
faucett   0.923077  0.857143  0.363636  0.333333  0.266667  0.000000   
faucetd   0.923077  0.857143  0.363636  0.333333  0.266667  0.857143   
bth       0.222222  0.200000  0.857143  0.250000  0.545455  0.200000   
kichen    0.333333  0.307692  0.200000  0.000000  0.142857  0.307692   

           faucetd       bth    kichen  
faucet    0.923077  0.222222  0.333333  
faucets   0.857143  0.200000  0.307692  
bath      0.363636  0.857143  0.200000  
parts     0.333333  0.250000  0.000000  
bathroom  0.266667  0.545455  0.142857  
faucett   0.857143  0.200000  0.307692  
faucetd   0.000000  0.200000  0.307692  
bth       0.200000  0.000000  0.222222  
kichen    0.307692  0.222222  0.000000

此解决方案的问题:我浪费时间在已经知道正确拼写的单词上计算成对的距离比率.

The problem with this solution: I waste time computing pairwise distance ratios on words I already know are correctly spelled.

我想要给函数 arr1 arr2 (可以是不同的长度!),并输出具有比率的矩阵(不一定是正方形)

What I'd like is to hand a function arr1 and arr2 (which can be different lengths!) and output a matrix (not necessarily square) with the ratios.

结果看起来像这样(没有计算开销):

The result would look like this (without the computational overhead):

>>> df.drop(index=arr1, columns=arr2)

           faucet   faucets      bath     parts  bathroom
faucett  0.923077  0.857143  0.363636  0.333333  0.266667
faucetd  0.923077  0.857143  0.363636  0.333333  0.266667
bth      0.222222  0.200000  0.857143  0.250000  0.545455
kichen   0.333333  0.307692  0.200000  0.000000  0.142857

推荐答案

我认为您正在寻找

I think you're looking for cdist:

import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
from Levenshtein import ratio

arr1 = np.array(['faucet', 'faucets', 'bath', 'parts', 'bathroom'])
arr2 = np.array(['faucett', 'faucetd', 'bth', 'kichen'])

matrix = cdist(arr2.reshape(-1, 1), arr1.reshape(-1, 1), lambda x, y: ratio(x[0], y[0]))
df = pd.DataFrame(data=matrix, index=arr2, columns=arr1)

结果:

           faucet   faucets      bath     parts  bathroom
faucett  0.923077  0.857143  0.363636  0.333333  0.266667
faucetd  0.923077  0.857143  0.363636  0.333333  0.266667
bth      0.222222  0.200000  0.857143  0.250000  0.545455
kichen   0.333333  0.307692  0.200000  0.000000  0.142857

这篇关于从两个可变长度的字符串数组中返回相似度矩阵(scipy选项?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆