GridSearchCV如何用于群集(MeanShift或DBSCAN)? [英] How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

查看:290
本文介绍了GridSearchCV如何用于群集(MeanShift或DBSCAN)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 scikit-learn 将一些文本文档聚类。我正在尝试DBSCAN和MeanShift,并想确定哪些超参数(例如,MeanShift的带宽和DBSCAN的 eps )最适合我正在使用的数据类型(新闻文章)。

I'm trying to cluster some text documents using scikit-learn. I'm trying out both DBSCAN and MeanShift and want to determine which hyperparameters (e.g. bandwidth for MeanShift and eps for DBSCAN) best work for the kind of data I'm using (news articles).

我有一些测试数据,其中包含预先标记的簇。我一直在尝试使用 scikit-learn GridSearchCV ,但是不知道如何使用(或是否可以使用)在这种情况下可以使用此方法,因为它需要拆分测试数据,但是我想对整个数据集进行评估,并将结果与​​预先标记的数据进行比较。

I have some testing data which consists of pre-labeled clusters. I have been trying to use scikit-learn's GridSearchCV but don't understand how (or if it can) be applied in this case, since it needs the test data to be split, but I want to run the evaluation on the entire dataset and compare the results to the pre-labeled data.

我一直在尝试指定一个评分函数,用于将估算器的标签与真实标签进行比较,但是,这当然不起作用,因为仅对数据的一部分进行了聚类,而不是对所有数据进行了聚类。

I have been trying to specify a scoring function which compares the estimator's labels to the true labels, but of course it doesn't work because only a sample of the data has been clustered, not all of it.

这里合适的方法是什么?

What's an appropriate approach here?

推荐答案

您是否考虑过实施搜索自己

实现for循环并不是特别困难。即使您想优化两个参数,它也仍然相当简单。

It's not particularly hard to implement a for loop. Even if you want to optimize two parameters it's still fairly easy.

对于DBSCAN和MeanShift,我还是建议您先了解一下您的相似性度量。基于对测量的理解来选择参数,而不是进行参数优化以匹配某些标签(存在过度拟合的高风险),这更有意义。

For both DBSCAN and MeanShift I do however advise to first understand your similarity measure. It makes more sense to choose the parameters based on an understanding of your measure instead of parameter optimization to match some labels (which has a high risk of overfitting).

单词,两个文章应该应该在哪个距离上聚类?

In other words, at which distance are two articles supposed to be clustered?

如果这个距离从一个数据点到另一个数据点变化太大,这些算法会严重失败;并且您可能需要查找归一化的距离函数,以使实际相似度值再次有意义。 TF-IDF是文本的标准配置,但主要是在检索上下文中。在聚类环境中,它们的工作效果可能会更差。

If this distance varies too much from one data point to another, these algorithms will fail badly; and you may need to find a normalized distance function such that the actual similarity values are meaningful again. TF-IDF is standard on text, but mostly in a retrieval context. They may work much worse in a clustering context.

还请注意,MeanShift(类似于k均值)需要重新计算坐标-在文本数据上,这可能会产生不良结果;实际上,更新后的坐标变差了,而不是变好了。

Also beware that MeanShift (similar to k-means) needs to recompute coordinates - on text data, this may yield undesired results; where the updated coordinates actually got worse, instead of better.

这篇关于GridSearchCV如何用于群集(MeanShift或DBSCAN)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆