DBSCAN中的参数估计 [英] Parameter estimation in DBSCAN

查看：292 发布时间：2020/10/3 2:06:16 data-mining cluster-analysis dbscan

本文介绍了DBSCAN中的参数估计的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要根据名词具有不同介词的分布来查找自然存在的名词类别（例如，实物，工具，时间，地点等）。我尝试使用k-means聚类，但效果不佳，效果不佳，在我要查找的类上有很多重叠（可能是由于类的非球形形状和k-means中的随机初始化））。

I need to find naturally occurring classes of nouns based on their distribution with different preposition (like agentive, instrumental, time, place etc.). I tried using k-means clustering but of less help, it didn't work well, there was a lot of overlap over the classes that I was looking for (probably because of non-globular shape of classes and random initialisation in k-means).

我现在正在使用DBSCAN，但是在理解该聚类算法中的epsilon值和最小点值时遇到了麻烦。我可以使用随机值还是需要计算它们。谁能帮忙。特别是对于epsilon，至少在需要时如何计算它。

I am now working on using DBSCAN, but I have trouble understanding the epsilon value and mini-points value in this clustering algorithm. Can I use random values or do I need to compute them. Can anybody help. Particularly with epsilon, at least how to compute it if I need to.

推荐答案

使用您的域知识选择参数。 Epsilon是半径。您可以将其视为最小的群集大小。

Use your domain knowledge to choose the parameters. Epsilon is a radius. You can think of it as a minimum cluster size.

显然随机值不能很好地工作。作为试探法，您可以尝试看一下k距离图；

Obviously random values won't work very well. As a heuristic, you can try to look at a k-distance plot; but it's not automatic either.

做这两种方式的第一件事就是选择一个好的距离函数用于数据。并执行适当的规范化。

The first thing to do either way is to choose a good distance function for your data. And perform appropriate normalization.

对于 minPts，它再次取决于您的数据和需求。一个用户可能想要与另一个用户截然不同的值。当然，minPts和Epsilon是耦合的。如果将epsilon加倍，则大约需要将minPts增加2 ^ d（对于欧几里得距离，因为这是超球体体积增加的方式！）

As for "minPts" it again depends on your data and needs. One user may want a very different value than another. And of course minPts and Epsilon are coupled. If you double epsilon, you will roughly need to increase your minPts by 2^d (for Euclidean distance, because that is how the volume of a hypersphere increases!)

如果如果您想要大量细小且精细的群集，请选择一个较低的分钟数。如果您想要更大和更少的群集（和更多的噪音），请使用更大的分钟数。如果您根本不需要任何群集，请选择大于数据集大小的分钟...

If you want lots of small and fine detailed clusters, choose a low minpts. If you want larger and fewer clusters (and more noise), use a larger minpts. If you don't want any clusters at all, choose minpts larger than your data set size...

这篇关于DBSCAN中的参数估计的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

DBSCAN中的参数估计 [英] Parameter estimation in DBSCAN

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

DBSCAN中的参数估计 [英] Parameter estimation in DBSCAN

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭