Scikit-Learn One-hot-encode 在训练/测试拆分之前或之后 [英] Scikit-Learn One-hot-encode before or after train/test split

查看:45
本文介绍了Scikit-Learn One-hot-encode 在训练/测试拆分之前或之后的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究使用 scikit-learn 构建模型的两个场景,我无法弄清楚为什么其中一个返回的结果与另一个完全不同.两种情况(我所知道的)之间唯一不同的是,在一种情况下,我一次性(在整个数据上)对分类变量进行单热编码,然后在训练和测试之间进行拆分.在第二种情况下,我在训练和测试之间进行拆分,然后根据训练数据对两组进行单热编码.

I am looking at two scenarios building a model using scikit-learn and I can not figure out why one of them is returning a result that is so fundamentally different than the other. The only thing different between the two cases (that I know of) is that in one case I am one-hot-encoding the categorical variables all at once (on the whole data) and then splitting between training and test. In the second case I am splitting between training and test and then one-hot-encoding both sets based off of the training data.

后一种情况在技术上更适合判断过程的泛化误差,但这种情况返回的归一化基尼系数与第一种情况相比有很大不同(并且很糟糕 - 基本上没有模型).我知道第一种情况的基尼系数 (~0.33) 符合基于此数据构建的模型.

The latter case is technically better for judging the generalization error of the process but this case is returning a normalized gini that is dramatically different (and bad - essentially no model) compared to the first case. I know the first case gini (~0.33) is in line with a model built on this data.

为什么第二种情况会返回如此不同的基尼系数?仅供参考 该数据集包含数字和分类变量的混合.

Why is the second case returning such a different gini? FYI The data set contains a mix of numeric and categorical variables.

方法 1(one-hot 编码整个数据然后拆分) 这将返回:验证样本分数:0.3454355044(标准化的基尼系数).

from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston

def gini(solution, submission):
    df = zip(solution, submission, range(len(solution)))
    df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True)
    rand = [float(i+1)/float(len(df)) for i in range(len(df))]
    totalPos = float(sum([x[0] for x in df]))
    cumPosFound = [df[0][0]]
    for i in range(1,len(df)):
        cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
    Lorentz = [float(x)/totalPos for x in cumPosFound]
    Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
    return sum(Gini)

def normalized_gini(solution, submission):
    normalized_gini = gini(solution, submission)/gini(solution, solution)
    return normalized_gini

# Normalized Gini Scorer
gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)



if __name__ == '__main__':

    dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",")
    y=dat[['Hazard']].values.ravel()
    dat=dat.drop(['Hazard','Id'],axis=1)


    folds=train_test_split(range(len(y)),test_size=0.30, random_state=15) #30% test

    #First one hot and make a pandas df
    dat_dict=dat.T.to_dict().values()
    vectorizer = DV( sparse = False )
    vectorizer.fit( dat_dict )
    dat= vectorizer.transform( dat_dict )
    dat=pd.DataFrame(dat)


    train_X=dat.iloc[folds[0],:]
    train_y=y[folds[0]]
    test_X=dat.iloc[folds[1],:]
    test_y=y[folds[1]]


    rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
    rf.fit(train_X,train_y)
    y_submission=rf.predict(test_X)
    print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))

方法 2(先拆分,然后单热编码) 这将返回:验证样本分数:0.0055124452(标准化的基尼系数).

from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston

def gini(solution, submission):
    df = zip(solution, submission, range(len(solution)))
    df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True)
    rand = [float(i+1)/float(len(df)) for i in range(len(df))]
    totalPos = float(sum([x[0] for x in df]))
    cumPosFound = [df[0][0]]
    for i in range(1,len(df)):
        cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
    Lorentz = [float(x)/totalPos for x in cumPosFound]
    Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
    return sum(Gini)

def normalized_gini(solution, submission):
    normalized_gini = gini(solution, submission)/gini(solution, solution)
    return normalized_gini

# Normalized Gini Scorer
gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)



if __name__ == '__main__':

    dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",")
    y=dat[['Hazard']].values.ravel()
    dat=dat.drop(['Hazard','Id'],axis=1)


    folds=train_test_split(range(len(y)),test_size=0.3, random_state=15) #30% test

    #first split
    train_X=dat.iloc[folds[0],:]
    train_y=y[folds[0]]
    test_X=dat.iloc[folds[1],:]
    test_y=y[folds[1]]

    #One hot encode the training X and transform the test X
    dat_dict=train_X.T.to_dict().values()
    vectorizer = DV( sparse = False )
    vectorizer.fit( dat_dict )
    train_X= vectorizer.transform( dat_dict )
    train_X=pd.DataFrame(train_X)

    dat_dict=test_X.T.to_dict().values()
    test_X= vectorizer.transform( dat_dict )
    test_X=pd.DataFrame(test_X)


    rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
    rf.fit(train_X,train_y)
    y_submission=rf.predict(test_X)
    print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))

推荐答案

虽然前面的评论正确地建议最好先映射整个特征空间,但在您的情况下,训练和测试都包含所有列.

While the previous comments correctly suggest it is best to map over your entire feature space first, in your case both the Train and Test contain all of the feature values in all of the columns.

如果比较两个版本的vectorizer.vocabulary_,它们是完全一样的,所以映射上没有区别.因此,它不会导致问题.

If you compare the vectorizer.vocabulary_ between the two versions, they are exactly the same, so there is no difference in mapping. Hence, it cannot be causing the problem.

方法 2 失败的原因是因为当您执行此命令时,您的 dat_dict 按原始索引重新排序.

The reason Method 2 fails is because your dat_dict gets re-sorted by the original index when you execute this command.

dat_dict=train_X.T.to_dict().values()

换句话说,train_X 在这行代码中有一个混洗的索引.当你把它变成 dict 时,dict 顺序会重新排序为原始索引的数字顺序.这会导致您的训练和测试数据与 y 完全不相关.

In other words, train_X has a shuffled index going into this line of code. When you turn it into a dict, the dict order re-sorts into the numerical order of the original index. This causes your Train and Test data become completely de-correlated with y.

方法 1 没有这个问题,因为你在映射后对数据进行了 shuffle.

Method 1 doesn't suffer from this problem, because you shuffle the data after the mapping.

您可以通过在方法 2 中分配 dat_dict 时添加 .reset_index() 来解决此问题,例如,

You can fix the issue by adding a .reset_index() both times you assign the dat_dict in Method 2, e.g.,

dat_dict=train_X.reset_index(drop=True).T.to_dict().values()

这可确保在转换为 dict 时保留数据顺序.

This ensures the data order is preserved when converting to a dict.

当我添加那段代码时,我得到以下结果:
- 方法 1:验证样本得分:0.3454355044(归一化基尼系数)
- 方法二:验证样本分数:0.3438430991(归一化基尼系数)

When I add that bit of code, I get the following results:
- Method 1: Validation Sample Score: 0.3454355044 (normalized gini)
- Method 2: Validation Sample Score: 0.3438430991 (normalized gini)

这篇关于Scikit-Learn One-hot-encode 在训练/测试拆分之前或之后的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆