Scikit的DBSCAN聚类算法中有哪些嘈杂样本? [英] What are noisy samples in Scikit's DBSCAN clustering algorithm?

查看:90
本文介绍了Scikit的DBSCAN聚类算法中有哪些嘈杂样本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我应用Scikit的DBSCAN( http:// scikit -learn.org/stable/modules/generation/sklearn.cluster.DBSCAN.html ),在相似矩阵上,我得到了一系列标签。其中一些标签为-1。该文档称它们为噪音样品。



这些是什么?它们全都属于一个集群,还是因为它们嘈杂而每个都属于自己的集群?



谢谢

解决方案

这些不完全是集群的一部分。它们只是不属于任何集群的点,在某种程度上可以被忽略。



请记住,DBSCAN代表基于密度的应用程序空间集群噪声。 DBSCAN会检查以确保一个点在指定范围内具有足够的邻居,以将这些点分类到聚类中。落入任何主要集群?如果某个点在指定半径内没有足够的邻居以至于不能被视为集群的一部分怎么办?这些点被赋予簇标签 -1 并被认为是噪声。



因此



好吧,如果您正在分析数据点并且只对通用集群感兴趣,则可以减小数据大小并切出噪声。或者,如果您使用聚类分析对数据进行分类,则在某些情况下可以将噪声作为离群值丢弃。



在异常检测中,不适用于任何类别也很重要,因为它们可以表示问题或罕见事件。


If I apply Scikit's DBSCAN (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) on a similarity matrix, I get a series of labels back. Some of these labels are -1. The documentation calls them noisy samples.

What are these? Do they all belong to a single cluster, or do they each belong to their own cluster since they're noisy?

Thank you

解决方案

These are not exactly part of a cluster. They are simply points that do not belong to any clusters and can be "ignored" to some extent.

Remember, DBSCAN stands for "Density-Based Spatial Clustering of Applications with Noise." DBSCAN checks to make sure a point has enough neighbors within a specified range to classify the points into the clusters.

But what happens to the points that do not meet the criteria for falling into any of the main clusters? What if a point does not have enough neighbors within the specified radius to be considered part of a cluster? These are the points that are given the cluster label of -1 and are considered noise.

So what?

Well, if you are analyzing data points and you are only interested in the general clusters, you lower the size of the data and cut out the noise. Or, if you are using cluster analysis to classify data, in some cases it is possible to discard the noise as outliers.

In anomaly detection, points that do not fit into any category are also significant, as they can represent a problem or rare event.

这篇关于Scikit的DBSCAN聚类算法中有哪些嘈杂样本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆