numpy-使用numpy.fromfunction构造Jaro(或Levenshtein)距离矩阵 [英] Numpy - constructing matrix of Jaro (or Levenshtein) distances using numpy.fromfunction

查看:114
本文介绍了numpy-使用numpy.fromfunction构造Jaro(或Levenshtein)距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在正在做一些文本分析,作为分析的一部分,我需要得到一个特定列表中所有单词之间的Jaro距离矩阵(即成对距离矩阵),如下所示:

I am doing some text analysis right now and as part of it I need to get a matrix of Jaro distances between all of words in specific list (so pairwise distance matrix) like this one:

       │CHEESE CHORES GEESE  GLOVES
───────┼───────────────────────────
CHEESE │    0   0.222  0.177  0.444     
CHORES │0.222       0  0.422  0.333
GEESE  │0.177   0.422      0  0.300
GLOVES │0.444   0.333  0.300      0

因此,我尝试使用numpy.fromfunction构造它.根据文档和示例,它将坐标传递给函数,获取结果,构造结果矩阵.

So, I tried to construct it using numpy.fromfunction. Per documentation and examples it passes coordinates to the function, gets its results, constructs the matrix of results.

我尝试了以下方法:

from jellyfish import jaro_distance

def distance(i, j):
    return 1 - jaro_distance(feature_dict[i], feature_dict[j])

feature_dict = 'CHEESE CHORES GEESE GLOVES'.split()
distance_matrix = np.fromfunction(distance, shape=(len(feature_dict),len(feature_dict)))

注意:jaro_distance仅接受2个字符串并返回浮点数.

Notice: jaro_distance just accepts 2 strings and returns a float.

我得到一个错误:

File "<pyshell#26>", line 4, in distance
    return 1 - jaro_distance(feature_dict[i], feature_dict[j])
TypeError: only integer arrays with one element can be converted to an index

我在函数的开头添加了print(i)print(j),我发现传递的不是奇数,而是奇数:

I added print(i), print(j) into beginning of the function and I found that instead of real coordinates something odd is passed:

[[ 0.  0.  0.  0.]
 [ 1.  1.  1.  1.]
 [ 2.  2.  2.  2.]
 [ 3.  3.  3.  3.]]
[[ 0.  1.  2.  3.]
 [ 0.  1.  2.  3.]
 [ 0.  1.  2.  3.]
 [ 0.  1.  2.  3.]]

为什么? numpy网站上的示例清楚地表明,只有两个整数通过,没有别的.

Why? The examples on numpy site clearly show that just two integers are passed, nothing else.

我尝试使用lambda函数准确地重现他们的示例,但出现了完全相同的错误:

I tried to exactly reproduce their example using a lambda function, but I get exactly same error:

distance_matrix = np.fromfunction(lambda i, j: 1 - jaro_distance(feature_dict[i], feature_dict[j]), shape=(len(feature_dict),len(feature_dict)))

感谢您的帮助-我想我以某种方式误解了它.

Any help is appreciated - I assume I misunderstood it somehow.

推荐答案

按照@xnx的建议,我已经调查了

As suggested by @xnx I have investigated the question and found out that fromfunc is not passing coordinates one by one, but actually passess all of indexies at the same time. Meaning that if shape of array would be (2,2) numpy will not perform f(0,0), f(0,1), f(1,0), f(1,1), but rather will perform:

f([[0., 0.], [1., 1.]], [[0., 1.], [0., 1.]])

但是看起来我的特定功能可以向量化,并且会产生所需的结果.因此,实现所需代码的代码如下:

But looks like my specific function could vectorized and will produce needed results. So the code to achieve the needed is below:

from jellyfish import jaro_distance
import numpy
def distance(i, j):
    return 1 - jaro_distance(feature_dict[i], feature_dict[j])

feature_dict = 'CHEESE CHORES GEESE GLOVES'.split()

funcProxy = np.vectorize(distance)

distance_matrix = np.fromfunction(funcProxy, shape=(len(feature_dict),len(feature_dict)))

它工作正常.

这篇关于numpy-使用numpy.fromfunction构造Jaro(或Levenshtein)距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆