使用DBSCAN进行集群:如果不预先设置集群数,如何训练模型? [英] Clustering with DBSCAN: How to train a model if you dont set the number of clusters in advance?

查看:243
本文介绍了使用DBSCAN进行集群:如果不预先设置集群数,如何训练模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用sklearn的内置数据集虹膜进行聚类。在KMeans中,我预先设置了群集数,但对于DBSCAN而言并非如此。如果您不预先设置簇数,该如何训练模型?



我尝试过:

  import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotib inline
sklearn.cluster中的
导入DBSCAN,MeanShift
sklearn.datasets导入load_iris
sklearn.model_selection导入train_test_split,KFold,cross_val_score
sklearn.metrics导入precision_score,confusion_matrix

iris = load_iris()
X = iris.data
y = iris.target

dbscan = DBSCAN(eps = 0.3,min_samples = 10)

dbscan.fit(X,y)

我被卡住了!

解决方案

DBSCAN是一种聚类算法,因此,它不使用标签 y 。的确,您可以将其 fit 方法用作 .fit(X,y)的方法,但是,根据< a href = https://scikit-learn.org/stable/modules/generation/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN.fit rel = nofollow noreferrer>文档:


y:已忽略


未使用,此处用于约定API一致性。


DBSCAN的另一个特点是,与KMeans之类的算法相比,它不将簇数作为输入;相反,它也单独估计


我们已经澄清了,让我们修改


就是这样。


与所有聚类一样算法,这里是监督学习的常见概念,例如训练/测试拆分,使用看不见的数据进行预测,交叉验证等不成立。为了使我们对我们的数据有一个总体了解,这种无监督的方法可能在初始探索性​​数据分析(EDA)中很有用-但是,正如您可能已经注意到的那样,这种分析的结果对于有监督的问题:这里,尽管我们的虹膜数据集中存在3个标签,但是该算法仅发现了2个簇。


...当然,可能会改变,具体取决于模型参数。实验...


I am using built-in dataset iris from sklearn for clustering. In KMeans I set the number of clusters in advance but it is not true for DBSCAN. How to train a model if you dont set the number of clusters in advance?

I tried:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotib inline

from sklearn.cluster import DBSCAN,MeanShift
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.metrics import accuracy_score,confusion_matrix

iris = load_iris()
X = iris.data
y = iris.target

dbscan = DBSCAN(eps=0.3,min_samples=10)

dbscan.fit(X,y)

I have got stuck on it!

解决方案

DBSCAN is a clustering algorithm and, as such, it does not employ the labels y. It is true that you can use its fit method as .fit(X, y) but, according to the docs:

y: Ignored

Not used, present here for API consistency by convention.

The other characteristic of DBSCAN is that, in contrast to algorithms such as KMeans, it does not take the number of clusters as an input; instead, it also estimates their number by itself.

Having clarified that, let's adapt the documentation demo with the iris data:

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

X, labels_true = load_iris(return_X_y=True) 
X = StandardScaler().fit_transform(X)

# Compute DBSCAN
db = DBSCAN(eps=0.5,min_samples=5) # default parameter values
db.fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))

Result:

Estimated number of clusters: 2
Estimated number of noise points: 17
Homogeneity: 0.560
Completeness: 0.657
V-measure: 0.604
Adjusted Rand Index: 0.521
Adjusted Mutual Information: 0.599
Silhouette Coefficient: 0.486

Let's plot them:

# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

That's it.

As with all clustering algorithms, here the usual notions of supervised learning, like train/test split, predict with unseen data, cross validation etc do not hold. Such unsupervised methods may be useful in an initial exploratory data analysis (EDA), in order to give us a general idea about our data - but, as you may have noticed already, it is not necessary that the findings from such analysis are useful for supervised problems: here, despite the existence of 3 labels in our iris dataset, the algorithm uncovered only 2 clusters.

... which may of course change, depending on the model parameters. Experiment...

这篇关于使用DBSCAN进行集群:如果不预先设置集群数,如何训练模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆