绘制高维数据的决策边界 [英] Plotting decision boundary for High Dimension Data

查看:273
本文介绍了绘制高维数据的决策边界的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在建立一个二进制分类问题的模型,其中每个数据点的维度为 300维(我正在使用300个要素).我正在使用 sklearn 中的 PassiveAggressiveClassifier .该模型的表现非常好.

I am building a model for binary classification problem where each of my data points is of 300 dimensions (I am using 300 features). I am using a PassiveAggressiveClassifier from sklearn. The model is performing really well.

我希望绘制模型的决策边界.我该怎么办?

为了了解数据,我正在使用TSNE将其绘制为2D格式.我分两步将数据的大小从300减少到50,然后从50减少到2(这是常见的建议).以下是相同的代码段:

To get a sense of the data, I am plotting it in 2D using TSNE. I reduced the dimensions of the data in 2 steps - from 300 to 50, then from 50 to 2 (this is a common recomendation). Below is the code snippet for the same :

from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD

X_Train_reduced = TruncatedSVD(n_components=50, random_state=0).fit_transform(X_train)
X_Train_embedded = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(X_Train_reduced)

#some convert lists of lists to 2 dataframes (df_train_neg, df_train_pos) depending on the label - 

#plot the negative points and positive points
scatter(df_train_neg.val1, df_train_neg.val2, marker='o', c='red')
scatter(df_train_pos.val1, df_train_pos.val2, marker='x', c='green')

我得到一个不错的图表.

I get a decent graph.

有没有一种方法可以向该图添加一个决策边界,该边界代表我的模型在300个昏暗空间中的实际决策边界?

推荐答案

一种方法是在您的2D图上施加Voronoi镶嵌,即根据与2D数据点的接近程度对其进行着色(每个预测的类标签使用不同的颜色) .参见 Migut等人,2015 的最新论文.

One way is to impose a Voronoi tesselation on your 2D plot, i.e. color it based on proximity to the 2D data points (different colors for each predicted class label). See recent paper by Migut et al., 2015.

这比使用meshgrid和scikit的KNeighborsClassifier听起来容易得多(这是Iris数据集的端到端示例;用模型/代码替换前几行):

This is a lot easier than it sounds using a meshgrid and scikit's KNeighborsClassifier (this is an end to end example with the Iris dataset; replace the first few lines with your model/code):

import numpy as np, matplotlib.pyplot as plt
from sklearn.neighbors.classification import KNeighborsClassifier
from sklearn.datasets.base import load_iris
from sklearn.manifold.t_sne import TSNE
from sklearn.linear_model.logistic import LogisticRegression

# replace the below by your data and model
iris = load_iris()
X,y = iris.data, iris.target
X_Train_embedded = TSNE(n_components=2).fit_transform(X)
print X_Train_embedded.shape
model = LogisticRegression().fit(X,y)
y_predicted = model.predict(X)
# replace the above by your data and model

# create meshgrid
resolution = 100 # 100x100 background pixels
X2d_xmin, X2d_xmax = np.min(X_Train_embedded[:,0]), np.max(X_Train_embedded[:,0])
X2d_ymin, X2d_ymax = np.min(X_Train_embedded[:,1]), np.max(X_Train_embedded[:,1])
xx, yy = np.meshgrid(np.linspace(X2d_xmin, X2d_xmax, resolution), np.linspace(X2d_ymin, X2d_ymax, resolution))

# approximate Voronoi tesselation on resolution x resolution grid using 1-NN
background_model = KNeighborsClassifier(n_neighbors=1).fit(X_Train_embedded, y_predicted) 
voronoiBackground = background_model.predict(np.c_[xx.ravel(), yy.ravel()])
voronoiBackground = voronoiBackground.reshape((resolution, resolution))

#plot
plt.contourf(xx, yy, voronoiBackground)
plt.scatter(X_Train_embedded[:,0], X_Train_embedded[:,1], c=y)
plt.show()

请注意,与其精确地绘制决策边界,不如仅对边界应位于的位置进行估算(尤其是在数据点很少的区域,真实边界可能会偏离此边界).它将在属于不同类的两个数据点之间画一条线,但是将其放置在中间(在这种情况下,确实可以保证这些点之间的决策边界,但不一定必须在中间)

Note that rather than precisely plotting your decision boundary, this will just give you an estimate of roughly where the boundary should lie (especially in regions with few data points, the true boundary can deviate from this). It will draw a line between two data points belonging to different classes, but will place it in the middle (there is indeed guaranteed to be a decision boundary between those points in this case, but it does not necessarily have to be in the middle).

还有一些实验方法可以更好地逼近真实决策边界,例如 github上的这个

There are also some experimental approaches to better approximate the true decision boundary, e.g. this one on github

这篇关于绘制高维数据的决策边界的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆