计算成对 simhash“距离" [英] calculate pairwise simhash "distances"

查看:82
本文介绍了计算成对 simhash“距离"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想构建一个成对距离矩阵,其中距离"是实现的两个字符串之间的相似度分数这里.我正在考虑使用 sci-kit learn 的成对距离方法来执行此操作,因为我之前已将其用于其他计算,并且易于并行化.

I want to construct a pairwise distance matrix where the "distances" are the similarity scores between two strings as implemented here. I was thinking of using sci-kit learn's pairwise distance method to do this, as I've used it before for other calculations and the easy parallelization is great.

这是相关的一段代码:

def hashdistance(str1, str2):
    hash1 = simhash(str1)
    hash2 = simhash(str2)

    distance = 1 - hash1.similarity(hash2)

    return distance   


strings = [d['string'] for d in data]
distance_matrix = pairwise_distances(strings, metric = lambda u,v: hashdistance(u, v))

strings 看起来像 ['foo', 'bar', 'baz'].

当我尝试这个时,它抛出错误ValueError: could not convert string to float.这可能是一件非常愚蠢的事情,但我不确定为什么需要在这里进行转换,以及为什么会抛出该错误:metric 中的匿名函数可以接受字符串并返回一个漂浮;为什么输入需要是浮点数,我如何基于 simhash 'distances' 创建这个成对距离矩阵?

When I try this, it throws the error ValueError: could not convert string to float. This might be a really dumb thing to say, but I'm not sure why the conversion would need to happen here, and why it's throwing that error: the anonymous function in metric can take strings and return a float; why do the inputs need to be floats, and how can I create this pairwise distance matrix based on simhash 'distances'?

推荐答案

根据 文档,仅允许来自 scipy.spatial.distance 的指标,或可调用:

According to the documentation, only metrics from scipy.spatial.distance are allowed, or a callable from:

In [26]: sklearn.metrics.pairwise.pairwise_distance_functions
Out[26]:
{'cityblock': <function sklearn.metrics.pairwise.manhattan_distances>,
 'euclidean': <function sklearn.metrics.pairwise.euclidean_distances>,
 'l1': <function sklearn.metrics.pairwise.manhattan_distances>,
 'l2': <function sklearn.metrics.pairwise.euclidean_distances>,
 'manhattan': <function sklearn.metrics.pairwise.manhattan_distances>}

一个问题是,如果 metriccallable 然后 sklearn.metrics.pairwise.check_pairwise_arrays 会尝试将输入转换为浮点数,(scipy.spatial.distance.pdist 做了类似的事情,所以你运气不好)因此你的错误.

One issue is that if metric is callable then sklearn.metrics.pairwise.check_pairwise_arrays tries to convert the input to float, (scipy.spatial.distance.pdist does something similar, so you're out of luck there) thus your error.

即使您可以传递一个可调用对象,它也不会很好地扩展,因为 pairwise_distances 中的循环是纯 Python 的.看起来您必须自己编写循环.我建议阅读 pdist 和/或 pairwise_distances 的源代码以获取有关如何执行此操作的提示.

Even if you could pass a callable it wouldn't scale very well, since the loop in pairwise_distances is pure Python. It looks like you'll have to just write the loop yourself. I would suggest reading the source code of pdist and/or pairwise_distances for hints as to how to do this.

这篇关于计算成对 simhash“距离"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆