在python中的sklearn中绘制DBSCAN中的特定点 [英] Plot specific points in DBSCAN in sklearn in python

查看:48
本文介绍了在python中的sklearn中绘制DBSCAN中的特定点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组文档,并从中创建了一个特征矩阵.然后我计算文档之间的余弦相似度.我将该余弦距离矩阵输入到 DBSCAN 算法中.我的代码如下.

将pandas导入为pd将numpy导入为np从 sklearn.metrics 导入 pairwise_distances从scipy.spatial.distance导入余弦从 sklearn.cluster 导入 DBSCAN#初始化一些文件doc1 = {'科学':0.8,'历史':0.05,'政治':0.15,'体育':0.1}doc2 = {'新闻':0.2,'艺术':0.8,'政治':0.1,'体育':0.1}doc3 = {'科学':0.8,'历史':0.1,'政治':0.05,'新闻':0.1}doc4 = {'Science':0.1,'Weather':0.2,'Art':0.7,'Sports':0.1}doc5 = {'科学':0.2,'天气':0.7,'艺术':0.8,'体育':0.9}doc6 = {科学":0.2,天气":0.8,艺术":0.8,体育":1.0}集合= [doc1,doc2,doc3,doc4,doc5,doc6]df = pd.DataFrame(集合)# 用零填充缺失值df.fillna(0,就地=真)# 获取特征向量特征矩阵 = df.as_matrix()打印(feature_matrix.tolist())# 获取对之间的余弦距离sims = pairwise_distances(feature_matrix, metric='cosine')#适合DBSCANdb = DBSCAN(min_samples = 1,metric ='precomputed').fit(sims)

现在,如 sklearn 的 DBSCAN

I have a set of documents and I create a feature matrix from it. Then I calculate cosine similarity between the documents. I input that cosine distance matrix to DBSCAN algorithm. My code is as follows.

import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
from sklearn.cluster import DBSCAN

# Initialize some documents
doc1 = {'Science':0.8, 'History':0.05, 'Politics':0.15, 'Sports':0.1}
doc2 = {'News':0.2, 'Art':0.8, 'Politics':0.1, 'Sports':0.1}
doc3 = {'Science':0.8, 'History':0.1, 'Politics':0.05, 'News':0.1}
doc4 = {'Science':0.1, 'Weather':0.2, 'Art':0.7, 'Sports':0.1}
doc5 = {'Science':0.2, 'Weather':0.7, 'Art':0.8, 'Sports':0.9}
doc6 = {'Science':0.2, 'Weather':0.8, 'Art':0.8, 'Sports':1.0}
collection = [doc1, doc2, doc3, doc4, doc5, doc6]
df = pd.DataFrame(collection)
# Fill missing values with zeros
df.fillna(0, inplace=True)
# Get Feature Vectors
feature_matrix = df.as_matrix()
print(feature_matrix.tolist())

# Get cosine distance between pairs
sims = pairwise_distances(feature_matrix, metric='cosine')

# Fit DBSCAN
db = DBSCAN(min_samples=1, metric='precomputed').fit(sims)

Now, as shown in DBSCAN demo of sklearn I plot the clusters. That is, instead of X I insert sims, which is my cosine distance matrix.

labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
#print(labels)

# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = sims[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = sims[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

  1. My first question is, is it correct to change sims instead of X, because X represents coordinate values in the demo of sklearn whereas sims represent cosine distance values?
  2. My second question is, is it possible to make the given points into red color? For example I want to change the point that reprsents [0.8, 0.0, 0.0, 0.0, 0.2, 0.9, 0.7] of the feature_matrix to red?

解决方案

First a comment about terminology:

There are two types of matrices that measure the closeness of objects in a data set:

  • Distance matrix describes pairwise distances between objects in a data set.

  • Similarity matrix describes pairwise similarities between objects in a data set.

In general, when two objects are close to each other, their distance is small, but their similarity is large. So the distance matrix and the similarity matrix are in some sense the opposites of each other. For example, for cosine metric the relation between distance matrix D and similarity matrix S can be written as D = 1 - S.

As the sims array in the above example contains pairwise distances, it might be more appropriate to call it a dists array.


My first question is, is it correct to change sims instead of X, because X represents coordinate values in the demo of sklearn whereas sims represent cosine distance values?

No. If you are plotting your data on a 2-dimensional plane, the plotting function needs a 2-dimensional coordinate array as input. A distance matrix will not suffice.

If your data has more than two dimensions, you can obtain a 2-dimensional representation of it via some dimensional reduction technique. Sklearn contains many useful dimensional reduction algorithms in the sklearn.manifold and sklearn.decomposition modules. The choice of the algorithm usually depends on the nature of the data, and might need some experimentation.

In sklearn, most dimensional reduction methods accept the feature (or coordinate) vectors as input. Some also accept a distance or a similarity matrix (this needs to be checked from the documentation; a good hint is that the keyword precomputed is mentioned somewhere). One should also be careful to not use a similarity matrix where a distance matrix is required, and vice versa.


My second question is, is it possible to make the given points into red color? For example I want to change the point that reprsents [0.8, 0.0, 0.0, 0.0, 0.2, 0.9, 0.7] of the feature_matrix to red?

Question 2 is a bit different and mostly deals with matplotlib.

I assume one knows beforehand which points will be painted red. There is an array called red_points in the code below that should contain the indices of the red points. So if for example doc2 and doc5 should be painted red, one would set red_points = [1, 4] (indices start at zero).

For visualization of the clusters, dimensional reduction is done with principal component analysis (PCA), which is one of the most straightforward methods for such task. Note that I do not compute the distance matrix at all but apply both DBSCAN and PCA directly on feature_matrix.

import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
from sklearn.cluster import DBSCAN

# Initialize some documents
doc1 = {'Science':0.8, 'History':0.05, 'Politics':0.15, 'Sports':0.1}
doc2 = {'News':0.2, 'Art':0.8, 'Politics':0.1, 'Sports':0.1}
doc3 = {'Science':0.8, 'History':0.1, 'Politics':0.05, 'News':0.1}
doc4 = {'Science':0.1, 'Weather':0.2, 'Art':0.7, 'Sports':0.1}
doc5 = {'Science':0.2, 'Weather':0.7, 'Art':0.8, 'Sports':0.9}
doc6 = {'Science':0.2, 'Weather':0.8, 'Art':0.8, 'Sports':1.0}
collection = [doc1, doc2, doc3, doc4, doc5, doc6]
df = pd.DataFrame(collection)
# Fill missing values with zeros
df.fillna(0, inplace=True)
# Get Feature Vectors
feature_matrix = df.as_matrix()

# Fit DBSCAN
db = DBSCAN(min_samples=1).fit(feature_matrix)

labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True

# Plot result
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

#  Perform dimensional reduction of the feature matrix with PCA
X = PCA(n_components=2).fit_transform(feature_matrix) 

# Select which points will be painted red
red_points = [1, 4]
for i in red_points:
    labels[i] = -2

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]

for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]
    if k == -2:
        # Red for selected points
        col = [1, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

The left figure is for the case where red_points is empty, the right figure for red_points = [1, 4].

这篇关于在python中的sklearn中绘制DBSCAN中的特定点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆