PCA之后的最佳特征选择技术? [英] Optimal Feature Selection Technique after PCA?

查看:64
本文介绍了PCA之后的最佳特征选择技术?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用RandomForestClassifier实现具有二进制结果的分类任务,并且我知道进行数据预处理以提高准确性得分的重要性.特别是,我的数据集包含100多个特征和将近4000个实例,并且我想执行降维技术以避免过度拟合,因为数据中存在大量噪声.

对于这些任务,我通常使用经典的特征选择方法(过滤器,包装器,特征重要性),但最近我阅读了有关结合主成分分析(PCA)(第一步),然后在转换后的数据集上进行特征选择的方法./p>

我的问题如下:对数据执行PCA之后,是否应该使用一种特定的功能选择方法?特别是,我想了解的是在数据上使用PCA是否会使某些特定的特征选择技术无效或效率较低.

解决方案

让我们从何时开始使用PCA开始.

当您不确定数据的哪一部分会影响准确性时,PCA最为有用.

让我们考虑一下面部识别任务.一眼就能说出最关键的像素吗?

例如:Olivetti面对.40个人,深色均匀背景,变化的光线,面部表情(睁开/闭合的眼睛,微笑/不微笑)和面部细节(眼镜/不戴眼镜).

所以,如果我们看一下像素之间的相关性:

来自sklearn.datasets的

 导入fetch_olivetti_faces从numpy import corrcoef从numpy导入zeros_like从numpy导入triu_indices_from从matplotlib.pyplot导入图中从matplotlib.pyplot导入get_cmap从matplotlib.pyplot导入图中从matplotlib.pyplot导入颜色栏从matplotlib.pyplot导入子图中从matplotlib.pyplot导入字幕从matplotlib.pyplot导入imshow从matplotlib.pyplot导入xlabel从matplotlib.pyplot导入ylabel从matplotlib.pyplot导入savefig从matplotlib.image导入已读进口seabornolivetti = fetch_olivetti_faces()X = olivetti.images#火车y = olivetti.target#标签X = X.reshape(((X.shape [0],X.shape [1] * X.shape [2]))seaborn.set(font_scale = 1.2)seaborn.set_style("darkgrid")mask = zeros_like(corrcoef(X_resp))mask [triu_indices_from(mask)] =真使用seaborn.axes_style("white"):f,ax =子图(figsize =(20,15))斧= seaborn.heatmap(corrcoef(X),annot = True,遮罩=遮罩vmax = 1,vmin = 0,square = True,cmap ="YlGnBu",annot_kws = {" size:1})savefig('heatmap.png') 

从上面您可以告诉我哪些像素对于分类最重要?

但是,如果我问您,能否请您告诉我慢性肾脏病的最重要特征?"

您可以一目了然地告诉我:

如果我们从人脸识别任务中恢复过来,我们真的需要所有像素进行分类吗?

不,我们不.

上面您只能看到63像素,足以将一张脸识别为人.

请注意,识别一张脸而不是识别脸足以满足63像素的要求.您需要更多像素才能区分人脸.

所以我们要做的是减小尺寸.您可能想了解有关

如您所见,具有2个组件的PCA不足以进行区分.

那么您需要多少个组件?

  def display_n_components(obj):图(1,figsize =(6,3),dpi = 300)情节(obj.explained_variance_,线宽= 2)xlabel('Components')ylabel('Explained Variaces')pca_obj2 = PCA().fit(X)display_n_components(pca_obj2) 

您需要100个组件来进行良好的区分.

现在我们需要拆分训练和测试集.

来自sklearn.model_selection的

 导入train_test_split从sklearn.ensemble导入RandomForestClassifier从sklearn.decomposition导入PCA从sklearn.metrics导入precision_scoreX_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 42)X_train = X_train.reshape((X_train.shape [0],X.shape [1] * X.shape [2]))X_test = X_test.reshape((X_test.shape [0],X.shape [1] * X.shape [2]))pca = PCA(n_components = 100).fit(X)X_pca_tr = pca.transform(X_train)X_pca_te = pca.transform(X_test)forest1 = RandomForestClassifier(random_state = 42)forest1.fit(X_pca_tr,y_train)y_pred = forest1.predict(X_pca_te)print("\ nAccuracy:{:,.2f}%".format(accuracy_score(y_true = y_test,y_pred = y_pred _)* 100)) 

准确度将是:

您可能想知道,PCA是否会提高准确性?

答案是肯定的.

没有PCA:

I'm implementing a classification task with binary outcome using RandomForestClassifier and I know the importance of data preprocessing to improve the accuracy score. In particular, my dataset contains more than 100 features and almost 4000 instances and I want to perform a dimensionality reduction technique in order to avoid overfitting since there is an high presence of noise in the data.

For these tasks I usually use a classical Feature Selection method (filters, wrappers, feature importances) but I recently read about combining Principal Component Analysis (PCA) (in a first step) and then Feature selection on the transformed dataset.

My question is the following: is there a specific feature selection method that I should use after having performed PCA on my data? In particular, what I want to understand is whether the use of PCA on my data make the use of some particular Feature selection Technique useless or less efficient.

解决方案

Let's begin with when should we use PCA?

The PCA is most useful when you are not sure which component of your data is affecting the accuracy.

Let's think about the face-recognition task. Can we say the most crucial pixels at a glance?

For instance: Olivetti faces. 40 people, dark homogenous background, varying the lighting, facial expressions (open/closed eyes, smiling / not smiling), and facial details (glasses / no glasses).

So if we look at the correlations between the pixels:

from sklearn.datasets import fetch_olivetti_faces
from numpy import corrcoef
from numpy import zeros_like
from numpy import triu_indices_from
from matplotlib.pyplot import figure
from matplotlib.pyplot import get_cmap
from matplotlib.pyplot import plot
from matplotlib.pyplot import colorbar
from matplotlib.pyplot import subplots
from matplotlib.pyplot import suptitle
from matplotlib.pyplot import imshow
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import savefig
from matplotlib.image import imread
import seaborn


olivetti = fetch_olivetti_faces()

X = olivetti.images  # Train
y = olivetti.target  # Labels

X = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))

seaborn.set(font_scale=1.2)
seaborn.set_style("darkgrid")
mask = zeros_like(corrcoef(X_resp))
mask[triu_indices_from(mask)] = True
with seaborn.axes_style("white"):
    f, ax = subplots(figsize=(20, 15))
    ax = seaborn.heatmap(corrcoef(X), 
                         annot=True, 
                         mask=mask, 
                         vmax=1,
                         vmin=0,
                         square=True, 
                         cmap="YlGnBu",
                         annot_kws={"size": 1})
    
savefig('heatmap.png')

From above can you tell me which pixels are the most important for the classification?

However, if I ask you, "Could you please tell me the most important features for chronic kidney disease?"

You can tell me at a glance:

If we resume from the face-recognition task, do we need really all the pixels for the classification?

No, we don't.

Above you can see only 63 pixels suffice for to recognize a face as a human.

Please note that 63 pixes suffice for to recognize a face, not face-recognition. You need more pixels for the discrimination between faces.

So what we do is reducing the dimensionality. You might want to read more on about the Curse of dimensionality

Ok, so we decide to use PCA, since we don't need each pixel of the face image. We have to reduce the dimension.

To make visually understandable, I'm using 2 dimension.

def projection(obj, x, x_label, y_label, title, class_num=40, sample_num=10, dpi=300):
    x_obj = obj.transform(x)
    idx_range = class_num * sample_num
    fig = figure(figsize=(6, 3), dpi=dpi)
    ax = fig.add_subplot(1, 1, 1)
    c_map = get_cmap(name='jet', lut=class_num)
    scatter = ax.scatter(x_obj[:idx_range, 0], x_obj[:idx_range, 1], c=y[:idx_range],
                         s=10, cmap=c_map)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.set_title(title.format(class_num))
    colorbar(mappable=scatter)
    


pca_obj = PCA(n_components=2).fit(X)
x_label = "First Principle Component"
y_label = "Second Principle Component"
title = "PCA Projection of {} people"
projection(obj=pca_obj, x=X, x_label=x_label, y_label=y_label, title=title)

As you can see, PCA with 2 components isn't sufficient to discriminate.

So how many components do you need?

def display_n_components(obj):
    figure(1, figsize=(6,3), dpi=300)
    plot(obj.explained_variance_, linewidth=2)
    xlabel('Components')
    ylabel('Explained Variaces')


pca_obj2 = PCA().fit(X)
display_n_components(pca_obj2)

You need 100 components for good discrimination.

Now we need to split the train and test set.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train = X_train.reshape((X_train.shape[0], X.shape[1] * X.shape[2])) 
X_test = X_test.reshape((X_test.shape[0], X.shape[1] * X.shape[2]))

pca = PCA(n_components=100).fit(X)
X_pca_tr = pca.transform(X_train)
X_pca_te = pca.transform(X_test)

forest1 = RandomForestClassifier(random_state=42)
forest1.fit(X_pca_tr, y_train)
y_pred = forest1.predict(X_pca_te)
print("\nAccuracy:{:,.2f}%".format(accuracy_score(y_true=y_test, y_pred=y_pred_)*100))

The accuracy will be:

You might wonder, does PCA improves the accuracy?

The answer is Yes.

Without PCA :

这篇关于PCA之后的最佳特征选择技术?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆