Python:使用Levenshtein距离作为度量标准,使用scikit-learn的dbscan进行字符串聚类: [英] Python: String clustering with scikit-learn's dbscan, using Levenshtein distance as metric:

查看:605
本文介绍了Python:使用Levenshtein距离作为度量标准,使用scikit-learn的dbscan进行字符串聚类:的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试对多个URL数据集(每个大约一百万)进行聚类,以找到每个URL的原始和错别字.我决定使用levenshtein距离作为相似性度量,并使用dbscan作为聚类算法,因为k均值算法不起作用,因为我不知道聚类的数量.

I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a similarity metric, along with dbscan as the clustering algorithm as k-means algorithms won't work because I do not know the number of clusters.

使用Scikit-learn的dbscan实现时遇到一些问题.

I am facing some problems using Scikit-learn's implementation of dbscan.

以下代码段适用于采用我使用的格式的小型数据集,但是由于它是预先计算整个距离矩阵,因此占用O(n ^ 2)的空间和时间,对于我的大型数据集来说太多了.我已经运行了很多小时,但最终却占用了我PC的所有内存.

This snippet below works on small datasets in the format I an using, but since it is precomputing the entire distance matrix, that takes O(n^2) space and time and is way too much for my large datasets. I have run this for many hours but it just ends up taking all the memory of my PC.

lev_similarity = -1*np.array([[distance.levenshtein(w1[0],w2[0]) for w1 in words] for w2 in words])
dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1)
dbscan.fit(lev_similarity)

所以我认为我需要一种即时计算相似度的方法,因此尝试了这种方法.

So I figured I needed some way to compute the similarity on the fly and hence tried this method.

dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1, metric = distance.levenshtein)
dbscan.fit(words)

但是这种方法最终给我一个错误:

But this method ends up giving me an error:

ValueError: could not convert string to float: URL

我意识到这意味着它试图将输入转换为相似性函数并转换为浮点数.但是我不希望它那样做. 据我了解,它只需要一个可以接受两个参数并返回一个float值的函数,然后可以将其与eps进行比较,这就是levenshtein距离应该做的.

Which I realize means that its trying to convert the inputs to the similarity function to floats. But I don't want it to do that. As far as I understand, it just needs a function that can take two arguments and return a float value that it can then compare to eps, which the levenshtein distance should do.

我被困在这一点上,因为我不知道sklearn的dbscan的实现细节,以了解为什么它试图将其转换为浮点数,而且我对如何避免O(n ^也没有更好的主意. 2)矩阵计算.

I am stuck at this point, as I do not know the implementation details of sklearn's dbscan to find why it is trying to convert it to float, and neither do I have any better idea on how to avoid the O(n^2) matrix computation.

请让我知道是否有更好或更快的方法来聚类这些我可能忽略的字符串.

Please let me know if there is any better or faster way to cluster these many strings that I may have overlooked.

推荐答案

从scikit-learn常见问题解答中,您可以通过

From the scikit-learn faq you can do this by making a custom metric:

from leven import levenshtein       
import numpy as np
from sklearn.cluster import dbscan
data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return levenshtein(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)
dbscan(X, metric=lev_metric, eps=5, min_samples=2)

这篇关于Python:使用Levenshtein距离作为度量标准,使用scikit-learn的dbscan进行字符串聚类:的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆