为DBSCAN(R)选择eps和minpts? [英] Choosing eps and minpts for DBSCAN (R)?

查看:606
本文介绍了为DBSCAN(R)选择eps和minpts?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很长一段时间以来,我一直在寻找这个问题的答案,所以希望有人能帮助我。我正在使用R中的fpc库中的dbscan。例如,我正在查看USArrests数据集,并在其上使用dbscan,如下所示:

I've been searching for an answer for this question for quite a while, so I'm hoping someone can help me. I'm using dbscan from the fpc library in R. For example, I am looking at the USArrests data set and am using dbscan on it as follows:

library(fpc)
ds <- dbscan(USArrests,eps=20)

在这种情况下,仅通过反复试验来选择eps。但是我想知道是否有功能或代码可以自动选择最佳的eps /分钟。我知道有些书推荐绘制k到最接近的邻居的距离的图。即,x轴表示根据与第k个最近邻居的距离分类的点,y轴表示第k个最近邻居距离。

Choosing eps was merely by trial and error in this case. However I am wondering if there is a function or code available to automate the choice of the best eps/minpts. I know some books recommend producing a plot of the kth sorted distance to its nearest neighbour. That is, the x-axis represents "Points sorted according to distance to kth nearest neighbour" and the y-axis represents the "kth nearest neighbour distance".

这种类型的绘图有助于为eps和分钟数选择合适的值。我希望我能提供足够的信息,以便有人帮助我。我想发布一张我的意思的照片,但是我仍然是一个新手,所以现在还不能发布图像。

This type of plot is useful for helping choose an appropriate value for eps and minpts. I hope I have provided enough information for someone to be help me out. I wanted to post a pic of what I meant however I'm still a newbie so can't post an image just yet.

推荐答案

没有选择minPts的通用方法。这取决于您要查找的 。较低的MinPts意味着它将通过噪声构建更多的群集,因此请不要选择太小。

There is no general way of choosing minPts. It depends on what you want to find. A low minPts means it will build more clusters from noise, so don't choose it too small.

对于epsilon来说,存在很多方面。再次归结为选择对 this 数据集和 this minPts和 this 距离函数以及 this 归一化有效的方法。您可以尝试做一个knn距离直方图,然后在其中选择一个膝盖,但是可能看不到一个或多个。

For epsilon, there are various aspects. It again boils down to choosing whatever works on this data set and this minPts and this distance function and this normalization. You can try to do a knn distance histogram and choose a "knee" there, but there might be no visible one, or multiple.

OPTICS是DBSCAN的后继产品不需要epsilon参数(出于索引支持的性能原因,请参阅Wikipedia)。它要好得多,但是我认为在R中实现它很痛苦,因为它需要高级数据结构(理想情况下,数据索引树用于加速,而 updatable 堆用于优先级队列),而R

OPTICS is a successor to DBSCAN that does not need the epsilon parameter (except for performance reasons with index support, see Wikipedia). It's much nicer, but I believe it is a pain to implement in R, because it needs advanced data structures (ideally, a data index tree for acceleration and an updatable heap for the priority queue), and R is all about matrix operations.

天真地,人们可以想象OPTICS同时执行Epsilon的所有值,并将结果放入群集层次结构中。

Naively, one can imagine OPTICS as doing all values of Epsilon at the same time, and putting the results in a cluster hierarchy.

但是,您需要检查的第一件事-与要使用的聚类算法几乎无关-是确保您具有有用的距离函数和适当的数据归一化。如果您的距离退化,则 no 聚类算法将起作用。

The first thing you need to check however - pretty much independent of whatever clustering algorithm you are going to use - is to make sure you have a useful distance function and appropriate data normalization. If your distance degenerates, no clustering algorithm will work.

这篇关于为DBSCAN(R)选择eps和minpts?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆