分类:PCA和使用sklearn的逻辑回归 [英] classification: PCA and logistic regression using sklearn
问题描述
我有一个分类问题,即我想在运行主成分分析(PCA)之后,使用逻辑回归,根据一组数字特征预测一个二进制目标.
I have a classification problem, ie I want to predict a binary target based on a collection of numerical features, using logistic regression, and after running a Principal Components Analysis (PCA).
我有2个数据集:df_train
和df_valid
(分别是训练集和验证集)作为熊猫数据框,其中包含要素和目标.第一步,我使用了get_dummies
pandas函数将所有分类变量转换为布尔值.例如,我将拥有:
I have 2 datasets: df_train
and df_valid
(training set and validation set respectively) as pandas data frame, containing the features and the target. As a first step, I have used get_dummies
pandas function to transform all the categorical variables as boolean. For example, I would have:
n_train = 10
np.random.seed(0)
df_train = pd.DataFrame({"f1":np.random.random(n_train), \
"f2": np.random.random(n_train), \
"f3":np.random.randint(0,2,n_train).astype(bool),\
"target":np.random.randint(0,2,n_train).astype(bool)})
In [36]: df_train
Out[36]:
f1 f2 f3 target
0 0.548814 0.791725 False False
1 0.715189 0.528895 True True
2 0.602763 0.568045 False True
3 0.544883 0.925597 True True
4 0.423655 0.071036 True True
5 0.645894 0.087129 True False
6 0.437587 0.020218 True True
7 0.891773 0.832620 True False
8 0.963663 0.778157 False False
9 0.383442 0.870012 True True
n_valid = 3
np.random.seed(1)
df_valid = pd.DataFrame({"f1":np.random.random(n_valid), \
"f2": np.random.random(n_valid), \
"f3":np.random.randint(0,2,n_valid).astype(bool),\
"target":np.random.randint(0,2,n_valid).astype(bool)})
In [44]: df_valid
Out[44]:
f1 f2 f3 target
0 0.417022 0.302333 False False
1 0.720324 0.146756 True False
2 0.000114 0.092339 True True
我现在想应用PCA来减少问题的范围,然后使用sklearn中的LogisticRegression
进行训练并获得对我的验证集的预测,但是我不确定我遵循的过程是否正确.这是我的工作:
I would like now to apply a PCA to reduce the dimensionality of my problem, then use LogisticRegression
from sklearn to train and get prediction on my validation set, but I'm not sure the procedure I follow is correct. Here is what I do:
我的想法是我需要使用PCA以相同的方式转换我的训练和验证集.换句话说,我不能不单独执行PCA.否则,它们将被投影到不同的特征向量上.
The idea is that I need to transform both my training and validation set the same way with PCA. In other words, I can not perform PCA separately. Otherwise, they will be projected on different eigenvectors.
from sklearn.decomposition import PCA
pca = PCA(n_components=2) #assume to keep 2 components, but doesn't matter
newdf_train = pca.fit_transform(df_train.drop("target", axis=1))
newdf_valid = pca.transform(df_valid.drop("target", axis=1)) #not sure here if this is right
第二步:逻辑回归
这不是必需的,但我更喜欢将其保留为数据框:
Step2: Logistic Regression
It's not necessary, but I prefer to keep things as dataframe:
features_train = pd.DataFrame(newdf_train)
features_valid = pd.DataFrame(newdf_valid)
现在我进行逻辑回归
from sklearn.linear_model import LogisticRegression
cls = LogisticRegression()
cls.fit(features_train, df_train["target"])
predictions = cls.predict(features_valid)
我认为第2步是正确的,但是我对第1步有更多的疑问:我应该采用这种方式链接PCA,然后分类器吗?
I think step 2 is correct, but I have more doubts about step 1: is this the way I'm supposed to chain PCA, then a classifier ?
推荐答案
有一个管道.
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pca = PCA(n_components=2)
cls = LogisticRegression()
pipe = Pipeline([('pca', pca), ('logistic', clf)])
pipe.fit(features_train, df_train["target"])
predictions = pipe.predict(features_valid)
这篇关于分类:PCA和使用sklearn的逻辑回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!