拆分数据时使用scikit-learn标准化PCA [英] Normalize PCA with scikit-learn when data is split

查看:159
本文介绍了拆分数据时使用scikit-learn标准化PCA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个后续问题:如何通过PCA和scikit学习.

我正在创建一个情绪检测系统,我现在要做的是:

I'm creating an emotion detection system and what I do now is:

  1. 将数据分散在所有情感上(将数据分布在多个子集上).
  2. 将所有数据加在一起(将多个子集分成1组)
  3. 获取组合数据的PCA参数(self.pca = RandomizedPCA(n_components = self.n_components,whiten = True).fit(self.data))
  4. 每个情感(每个子集),将PCA应用于该情感(子集)的数据.

我应该在以下步骤进行归一化:步骤2)对所有组合数据进行归一化,步骤4)对子集进行归一化.

I should do the normalization at: step 2) Normalize all combined data, and step 4) normalize the subsets.

我想知道所有数据的归一化和子集的归一化是否相同.现在,当我尝试根据@BartoszKP的建议简化我的示例时,我发现我如何理解标准化的工作是错误的.两种情况下的规范化都以相同的方式工作,因此这是一种有效的方法,对吗? (请参见代码)

I was wondering if the normalization over all data and the normalization over subset is the same. Now when I tried to simplify my example on suggestion of @BartoszKP I figured out that how I understood the normalization worked, was wrong. The normalization in both cases work in the same way, so this is a valid way to do it, right? (see code)

from sklearn.preprocessing import normalize
from sklearn.decomposition import RandomizedPCA
import numpy as np

data_1 = np.array(([52, 254], [4, 128]), dtype='f')
data_2 = np.array(([39, 213], [123, 7]), dtype='f')
data_combined = np.vstack((data_1, data_2))
#print(data_combined)
"""
Output
[[  52.  254.]
 [   4.  128.]
 [  39.  213.]
 [ 123.    7.]]
"""
#Normalize all data
data_norm = normalize(data_combined)
print(data_norm)
"""
[[ 0.20056452  0.97968054]
 [ 0.03123475  0.99951208]
 [ 0.18010448  0.98364753]
 [ 0.99838448  0.05681863]]
"""

pca = RandomizedPCA(n_components=20, whiten=True)
pca.fit(data_norm)

#Normalize subset of data
data_1_norm = normalize(data_1)
print(data_1_norm)
"""
[[ 0.20056452  0.97968054]
 [ 0.03123475  0.99951208]]
"""
pca.transform(data_1_norm)

推荐答案

是的,如文档(normalize的作用)是独立于其他样本缩放单个样本:

Yes, as explained in the documentation, what normalize does, is scaling individual samples, independently to others:

归一化缩放单个样本以具有单位范数的过程.

Normalization is the process of scaling individual samples to have unit norm.

Normalizer类的文档:

每个具有至少一个非零分量的样本(即数据矩阵的每一行)都独立于其他样本进行重新缩放,以使其范数(l1或l2)等于1.

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.

(重点是我的)

这篇关于拆分数据时使用scikit-learn标准化PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆