k-均值使用从minhash生成的签名矩阵 [英] k-means using signature matrix generated from minhash

查看:449
本文介绍了k-均值使用从minhash生成的签名矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在文档及其带状疱疹上使用了minhash来从这些文档生成签名矩阵.我已经验证了签名矩阵可以很好地比较已知的相似文档(例如,关于同一运动队的两篇文章或关于同一世界事件的两篇文章)的jaccard距离,从而可以正确读取读数.

我的问题是:使用此签名矩阵执行k均值聚类是否有意义?

我尝试使用文档的签名向量,并在迭代kmeans算法中计算这些向量的欧式距离,而对于簇我总是一无所获.我知道应该有两个聚类(我的数据集是关于体育或商业的数千篇文章),最后我的两个聚类总是随机的.我坚信,每次将散列字散列为整数的随机性都会使距离函数产生偏差,并使两个签名矩阵中的相似散列值不堪重负.

解决方案

TL; DR

简短的回答:不,将签名矩阵用于K均值聚类没有任何意义.至少,并非没有明显的操纵.

一些解释

几天后我就想出了自己如何做同样的事情(文本聚类).我可能是错的,但我的看法是您犯了与我相同的错误:使用MinHash构建[n_samples x n_perms]矩阵,然后将其用作在其上运行k均值的特征矩阵X.

我猜你正在做类似的事情:

# THIS CODE IS AN EXAMPLE OF WRONG! DON'T IMPLEMENT!
import numpy as np
import MinHash
from sklearn.cluster import KMeans
# Get your data. 
data = get_your_list_of_strings_to_cluster()
n_samples = len(data)
# Minhash all the strings
n_perms = 128
minhash_values = np.zeros((n_samples, n_perms), dtype='uint64')
minhashes = []
for index, string in enumerate(data):
    minhash = MinHash(num_perm=n_perms)
    for gram in ngrams(string, 3):
         minhash.update("".join(gram).encode('utf-8'))
     minhash_values[index, :] = minhash.hashvalues
# Compute clusters
clusterer = KMeans(n_clusters=8)
clusters = clusterer.fit_predict(minhash_values)

由于致命的缺陷-minhash_values数组不是是一个特征矩阵,因此这将非常.基本上,每行都是出现在该文本示例中的功能(哈希)的列表...但是它们不是按列对齐的,因此功能分散在错误的维度中.

要将其转换为 feature 矩阵,您必须查看minhash_values中所有唯一的哈希,然后创建一个[n_samples x n_unique_hashes]矩阵,(n_unique_hashes是数字(找到的独特功能)将其设置为1,其中文本示例包含该功能,在其他地方为0.通常,此矩阵将很大且稀疏.然后,您可以在该群集上.

文本聚类的替代方式

这真是令人难以置信的麻烦!幸运的是,scikit-learn可以为您提供帮助.它提供了一些易于使用和可扩展的矢量化器:

因此,您的问题变得很容易解决:

# Imports
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import KMeans

# Get your data
data = get_your_list_of_strings_to_cluster()

# Get your feature matrix
text_features = HashingVectorizer(analyzer="word").fit_transform(data)

# Compute clusters
clusterer = KMeans(n_clusters=2)
clusters = clusterer.fit_predict(text_features)

然后您就去了.从那里:

  • 微调矢量化器(也尝试TfidfVectorizer,调整输入参数等),
  • 尝试其他群集器(f/ex我发现 HDBSCAN 英里更好 比kmeans-更快,更强大,更准确,更不需要调整.)

希望这会有所帮助.

汤姆

I have used minhash on documents and their shingles to generate a signature matrix from these documents. I have verified that the signature matrices are good as comparing jaccard distances of known similar documents (say, two articles about the same sports team or two articles about the same world event) give correct readings.

My question is: does it make sense to use this signature matrix to perform k-means clustering?

I've tried using the signature vectors of documents and calculating the euclidean distance of these vectors inside the iterative kmeans algorithm and I always get nonsense for my clusters. I know there should be two clusters (my data set is a few thousands articles about either sports or business) and in the end my two clusters are always just random. I'm convinced that the randomness of hashing words into integers is going to skew the distance function every time and overpower similar hash values in two signature matrices.

[Edited to highlight the question]

解决方案

TL;DR

Short answer: No, it doesn't make sense to use the signature matrix for K-means clustering. At least, not without significant manipulation.

Some explanations

I'm coming at this after a few days of figuring out how to do the same thing (text clustering) myself. I might be wrong, but my perception is that you're making the same mistake I was: using MinHash to build an [n_samples x n_perms] matrix, then using this as a features matrix X on which you run k-means.

I'm guessing you're doing something like:

# THIS CODE IS AN EXAMPLE OF WRONG! DON'T IMPLEMENT!
import numpy as np
import MinHash
from sklearn.cluster import KMeans
# Get your data. 
data = get_your_list_of_strings_to_cluster()
n_samples = len(data)
# Minhash all the strings
n_perms = 128
minhash_values = np.zeros((n_samples, n_perms), dtype='uint64')
minhashes = []
for index, string in enumerate(data):
    minhash = MinHash(num_perm=n_perms)
    for gram in ngrams(string, 3):
         minhash.update("".join(gram).encode('utf-8'))
     minhash_values[index, :] = minhash.hashvalues
# Compute clusters
clusterer = KMeans(n_clusters=8)
clusters = clusterer.fit_predict(minhash_values)

This will behave horribly because of the fateful flaw - the minhash_values array is not a feature matrix. Each row is basically a list of features (hashes) which appear in that sample of text... but they're not column-aligned so features are scattered into the wrong dimensions.

To turn that into a feature matrix, you'd have to look at all the unique hashes in minhash_values then create a matrix which is [n_samples x n_unique_hashes], (n_unique_hashes is the number of unique features found) setting it to 1 where the text sample contains that feature, 0 elsewhere. Typically this matrix would be large and sparse. You could then cluster on that.

Alternative way of text clustering

What an unbelievable hassle though! Fortunately, scikit-learn is there to help. It provides some very easy to use and scalable vectorisers:

So your problem becomes easily solved:

# Imports
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import KMeans

# Get your data
data = get_your_list_of_strings_to_cluster()

# Get your feature matrix
text_features = HashingVectorizer(analyzer="word").fit_transform(data)

# Compute clusters
clusterer = KMeans(n_clusters=2)
clusters = clusterer.fit_predict(text_features)

And there you go. From there:

  • Fine tune your vectoriser (try TfidfVectorizer too, tweak the input params, etc),
  • Try other clusterers (f/ex I find HDBSCAN miles better than kmeans - quicker, more robust, more accurate, less tuning).

Hope this helps.

Tom

这篇关于k-均值使用从minhash生成的签名矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆