scikit-learn如何知道集群中的文件? [英] scikit-learn how to know documents in the cluster?

查看：73 发布时间：2020/4/26 10:19:32 python cluster-analysis scikit-learn k-means

本文介绍了scikit-learn如何知道集群中的文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对python和scikit-learn都是陌生的，所以请多多包涵.

I am new to both python and scikit-learn so please bear with me.

我将此源代码用于 k表示聚类的k表示聚类算法.

然后我通过使用load_file函数修改为在本地集上运行.

I then modified to run on my local set by using load_file function.

尽管算法终止了，但是它不会产生任何输出，例如将文档聚类在一起.

Although the algorithm terminates, but it does not produce any output like which documents are clustered together.

我发现km对象具有"km.label"数组，该数组列出了每个文档的质心ID.

I found that the km object has "km.label" array which lists the centroid id of each document.

它也具有"km.cluster_centers_"的质心矢量

It also has the centroid vector with "km.cluster_centers_"

但是它是什么文件?我必须将其映射到作为"Bunch"对象的"dataset".

But what document it is ? I have to map it to "dataset" which is a "Bunch" object.

如果我打印dataset.data [0]，则会得到我认为被改组的第一个文件的数据.但是我只想知道名字.

If i print dataset.data[0], i get the data of first file which i think are shuffled. but i just want to know the name.

我对类似问题感到困惑，例如dataset.data [0]中的文档是否在km.label [0]处聚类为质心?

I am confused with questions like Does the document at dataset.data[0] is clusterd to centoid at km.label[0] ?

我的基本问题是查找将哪些文件群集在一起. 如何找到它?

My basic problem is to find which files are clustered together. How to find that ?

推荐答案

忘记Bunch对象.仅仅是加载与scikit-learn捆绑在一起的玩具数据集的实现细节.

Forget about the Bunch object. It's just an implementation detail to load the toy datasets that are bundled with scikit-learn.

在现实生活中，您只需拥有真实数据，就可以直接致电:

In real life, with you real data you just have to call directly:

km = KMeans(n_clusters).fit(my_document_features)

然后从以下位置收集集群分配:

then collect cluster assignments from:

km.labels_

my_document_features是2D数据结构:numpy数组或形状为(n_documents, n_features)的scipy.sparse矩阵.

my_document_features is a 2D datastructure: either a numpy array or a scipy.sparse matrix with shape (n_documents, n_features).

km.labels_是形状为(n_documents,)的一维numpy数组.因此，labels_中的第一个元素是在my_document_features特征矩阵的第一行中描述的文档簇的索引.

km.labels_ is a 1D numpy array with shape (n_documents,). Hence the first element in labels_ is the index of the cluster of the document described in the first row of the my_document_features feature matrix.

通常，您将使用TfidfVectorizer对象构建my_document_features:

Typically you would build my_document_features with a TfidfVectorizer object:

my_document_features = TfidfVectorizer().fit_transform(my_text_documents)

如果您直接读取文档(例如，从数据库或单个CSV文件或任何所需的行)，则

和my_text_documents将是列出python unicode对象的列表，或者:

and my_text_documents would a either a list python unicode objects if you read the documents directly (e.g. from a database or rows from a single CSV file or whatever you want) or alternatively:

vec = TfidfVectorizer(input='filename')
my_document_features = vec.fit_transform(my_text_files)

其中my_text_files是您的硬盘驱动器上的文档文件路径的python列表(假设它们是使用UTF-8编码进行编码的.)

where my_text_files is a python list of the path of your document files on your harddrive (assuming they are encoded using the UTF-8 encoding).

my_text_files或my_text_documents列表的长度应为n_documents，因此与km.labels_的映射是直接的.

The length of the my_text_files or my_text_documents lists should be n_documents hence the mapping with km.labels_ is direct.

由于scikit-learn不仅仅用于对文档进行聚类或分类，因此我们使用名称"sample"代替"document".这样一来，您将看到我们使用n_samples而不是n_documents来记录库中所有估计量的参数和属性的预期形状.

As scikit-learn is not just for clustering or categorizing documents, we use the name "sample" instead of "document". This is way you will see the we use n_samples instead of n_documents to document the expected shapes of the arguments and attributes of all the estimator in the library.

这篇关于scikit-learn如何知道集群中的文件?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

scikit-learn如何知道集群中的文件? [英] scikit-learn how to know documents in the cluster?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

scikit-learn如何知道集群中的文件? [英] scikit-learn how to know documents in the cluster?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭