执行PCA之前和之后的数据维度 [英] Dimension of data before and after performing PCA

查看:140
本文介绍了执行PCA之前和之后的数据维度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python和scikit-learn kaggle.com的数字识别器竞争

I'm attempting kaggle.com's digit recognizer competition using Python and scikit-learn.

从训练数据中删除标签后,我将CSV中的每一行添加到这样的列表中:

After removing labels from the training data, I add each row in CSV into a list like this:

for row in csv:
    train_data.append(np.array(np.int64(row)))

我对测试数据也做同样的事情.

I do the same for the test data.

我用PCA预处理了这些数据,以进行降维(和特征提取?):

I pre-process this data with PCA in order to perform dimension reduction (and feature extraction?):

def preprocess(train_data, test_data, pca_components=100):
    # convert to matrix
    train_data = np.mat(train_data)

    # reduce both train and test data
    pca = decomposition.PCA(n_components=pca_components).fit(train_data)
    X_train = pca.transform(train_data)
    X_test = pca.transform(test_data)

    return (X_train, X_test)

然后我创建一个kNN分类器,并将其与X_train数据拟合,并使用X_test数据进行预测.

I then create a kNN classifier and fit it with the X_train data and make predictions using the X_test data.

使用这种方法,我可以获得大约97%的准确性.

Using this method I can get around 97% accuracy.

我的问题是关于执行PCA之前和之后的数据维度

My question is about the dimensionality of the data before and after PCA is performed

train_dataX_train的尺寸是什么?

组件数量如何影响输出的尺寸?他们是同一回事吗?

How does the number of components influence the dimensionality of the output? Are they the same thing?

推荐答案

PCA 算法查找数据协方差矩阵的特征向量.什么是特征向量?没有人知道,也没有人在乎(开玩笑!).重要的是,第一个特征向量是一个平行于数据最大方差方向的向量(直觉上:扩展).第二个代表最大散布方面的第二好方向,依此类推.另一个重要的事实是,这些向量彼此正交,因此它们形成了基础.

The PCA algorithm finds the eigenvectors of the data's covariance matrix. What are eigenvectors? Nobody knows, and nobody cares (just kidding!). What's important is that the first eigenvector is a vector parallel to the direction along which the data has the largest variance (intuitively: spread). The second one denotes the second-best direction in terms of the maximum spread, and so on. Another important fact is that these vectors are orthogonal to each other, so they form a basis.

pca_components参数告诉算法您对多少个最佳基础向量感兴趣.因此,如果传递100,则意味着您要获取描述以下内容的100基础向量:统计人员会说:解释)您数据的大部分差异.

The pca_components parameter tells the algorithm how many best basis vectors are you interested in. So, if you pass 100 it means you want to get 100 basis vectors that describe (statistician would say: explain) most of the variance of your data.

transform函数将数据从原始基础转换(srsly ?;)到由所选PCA组件(在本示例中为第一个 best 100个向量)形成的基础上.您可以将其可视化为旋转的点云,并忽略其某些尺寸.正如 Jaime 在评论中正确指出的,这等效于

The transform function transforms (srsly?;)) the data from the original basis to the basis formed by the chosen PCA components (in this example - the first best 100 vectors). You can visualize this as a cloud of points being rotated and having some of its dimensions ignored. As correctly pointed out by Jaime in the comments, this is equivalent of projecting the data onto the new basis.

对于3D情况,如果要获得由第一个2特征向量构成的基础,则再次将3D点云首先旋转,因此最大方差将平行于坐标轴.然后,方差最小的轴将被丢弃,剩下2D数据.

For the 3D case, if you wanted to get a basis formed of the first 2 eigenvectors, then again, the 3D point cloud would be first rotated, so the most variance would be parallel to the coordinate axes. Then, the axis where the variance is smallest is being discarded, leaving you with 2D data.

因此,直接回答您的问题:是的,所需PCA组件的数量是输出数据的维数(转换后).

So, to answer your question directly: yes, the number of the desired PCA components is the dimensionality of the output data (after the transformation).

这篇关于执行PCA之前和之后的数据维度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆