scikit-learn 如何知道集群中的文档? [英] scikit-learn how to know documents in the cluster?

查看:26
本文介绍了scikit-learn 如何知道集群中的文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 python 和 scikit-learn 的新手,所以请耐心等待.

I am new to both python and scikit-learn so please bear with me.

我从 k 表示聚类算法中获取了 k 表示聚类算法的源代码.

然后我使用 load_file 函数修改为在我的本地集上运行.

I then modified to run on my local set by using load_file function.

虽然算法终止了,但它并没有产生任何输出,比如哪些文档聚集在一起.

Although the algorithm terminates, but it does not produce any output like which documents are clustered together.

我发现km对象有km.label"列出每个文档的质心 id 的数组.

I found that the km object has "km.label" array which lists the centroid id of each document.

它还具有带有km.cluster_centers_"的质心向量

It also has the centroid vector with "km.cluster_centers_"

但它是什么文件?我必须将它映射到数据集"这是一个一堆"对象.

But what document it is ? I have to map it to "dataset" which is a "Bunch" object.

如果我打印 dataset.data[0],我会得到第一个文件的数据,我认为它是混洗的.但我只想知道名字.

If i print dataset.data[0], i get the data of first file which i think are shuffled. but i just want to know the name.

我对诸如 dataset.data[0] 处的文档是否聚集到 km.label[0] 处的质心这样的问题感到困惑?

I am confused with questions like Does the document at dataset.data[0] is clusterd to centoid at km.label[0] ?

我的基本问题是找出哪些文件聚集在一起.如何找到?

推荐答案

忘记 Bunch 对象.加载与 scikit-learn 捆绑的玩具数据集只是一个实现细节.

Forget about the Bunch object. It's just an implementation detail to load the toy datasets that are bundled with scikit-learn.

在现实生活中,您只需直接调用真实数据即可:

In real life, with you real data you just have to call directly:

km = KMeans(n_clusters).fit(my_document_features)

然后从以下位置收集集群分配:

then collect cluster assignments from:

km.labels_

my_document_features 是一个二维数据结构:要么是一个 numpy 数组,要么是一个形状为 (n_documents, n_features) 的 scipy.sparse 矩阵.

my_document_features is a 2D datastructure: either a numpy array or a scipy.sparse matrix with shape (n_documents, n_features).

km.labels_ 是一个形状为 (n_documents,) 的一维 numpy 数组.因此,labels_ 中的第一个元素是 my_document_features 特征矩阵的第一行中描述的文档集群的索引.

km.labels_ is a 1D numpy array with shape (n_documents,). Hence the first element in labels_ is the index of the cluster of the document described in the first row of the my_document_features feature matrix.

通常,您会使用 TfidfVectorizer 对象构建 my_document_features:

Typically you would build my_document_features with a TfidfVectorizer object:

my_document_features = TfidfVectorizer().fit_transform(my_text_documents)

my_text_documents 如果您直接读取文档(例如,从数据库或单个 CSV 文件中的行或任何您想要的文件)或其他方式:

and my_text_documents would a either a list python unicode objects if you read the documents directly (e.g. from a database or rows from a single CSV file or whatever you want) or alternatively:

vec = TfidfVectorizer(input='filename')
my_document_features = vec.fit_transform(my_text_files)

其中 my_text_files 是硬盘驱动器上文档文件路径的 Python 列表(假设它们使用 UTF-8 编码).

where my_text_files is a python list of the path of your document files on your harddrive (assuming they are encoded using the UTF-8 encoding).

my_text_filesmy_text_documents 列表的长度应该是 n_documents 因此与 km.labels_ 的映射是直接.

The length of the my_text_files or my_text_documents lists should be n_documents hence the mapping with km.labels_ is direct.

由于 scikit-learn 不仅仅用于对文档进行聚类或分类,因此我们使用名称sample"而不是document".通过这种方式,您将看到我们使用 n_samples 而不是 n_documents 来记录库中所有估计器的参数和属性的预期形状.

As scikit-learn is not just for clustering or categorizing documents, we use the name "sample" instead of "document". This is way you will see the we use n_samples instead of n_documents to document the expected shapes of the arguments and attributes of all the estimator in the library.

这篇关于scikit-learn 如何知道集群中的文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆