文件丛集基础 [英] Document Clustering Basics

查看:54
本文介绍了文件丛集基础的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我已经考虑了这些概念一段时间了,我的理解是非常基础的.信息检索似乎是一个很少涉及的话题...

So, I've been mulling over these concepts for some time, and my understanding is very basic. Information retrieval seems to be a topic seldom covered in the wild...

我的问题来自文档聚类的过程.假设我从一组仅包含有趣单词的文档开始.第一步是什么?解析每个文档中的单词并创建一个巨大的单词袋"类型模型?然后,我是否要继续为每个文档创建单词计数向量?如何使用K-means聚类比较这些文档?

My questions stem from the process of clustering documents. Let's say I start off with a collection of documents containing only interesting words. What is the first step here? Parse the words from each document and create a giant 'bag-of-words' type model? Do I then proceed to create vectors of word counts for each document? How do I compare these documents using something like the K-means clustering?

推荐答案

尝试 Tf-idf 对于初学者.
如果您阅读Python,请查看 使用MiniBatchKmeans聚类文本文档" 在 scikit-learn 中:
一个示例,展示了如何使用scikit-learn进行集群 文档使用主题词袋方法".
然后,源代码中的feature_extraction/text.py具有非常好的类.

Try Tf-idf for starters.
If you read Python, look at "Clustering text documents using MiniBatchKmeans" in scikit-learn:
"an example showing how the scikit-learn can be used to cluster documents by topics using a bag-of-words approach".
Then feature_extraction/text.py in the source has very nice classes.

这篇关于文件丛集基础的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆