相同形式的数据集具有2种不同的形状 [英] Same form of dataset has 2 different shapes
问题描述
我对机器学习非常陌生,只是在掌握技术.因此,我正在尝试使用具有4个特征和目标特征/类(真值 1 或 0 )的数据集在以下分类器上训练模型.
I am quite new to Machine Learning and am just grasping the techniques. As such, I am trying to train a model on the following classifiers using a dataset that has 4 features and the target feature/class (the truth value 1 or 0).
分类
- SGD分类器
- 随机森林分类器
- 线性支持向量分类器
- 高斯过程分类器
我正在以下数据集上训练模型[部分数据集如下所示.
I am training the model on the following dataset [Part of the dataset is shown below].
培训集:train_sop_truth.csv
Subject,Predicate,Object,Computed,Truth
concept:sportsteam:hawks,concept:teamplaysincity,concept:city:atlanta,0.4255912602,1
concept:stadiumoreventvenue:honda+AF8-center,concept:stadiumlocatedincity,concept:city:anaheim,0.4276425838,1
concept:sportsteam:ducks,concept:teamplaysincity,concept:city:anaheim,0.4762486517,1
concept:sportsteam:n1985+AF8-chicago+AF8-bears,concept:teamplaysincity,concept:city:chicago,0.4106097221,1
concept:stadiumoreventvenue:philips+AF8-arena,concept:stadiumlocatedincity,concept:city:atlanta,0.4190083146,1
concept:stadiumoreventvenue:united+AF8-center,concept:stadiumlocatedincity,concept:city:chicago,0.4211134315,1
测试数据集位于另一个.csv文件中,作为test_sop_truth.csv
.
The test dataset is in another .csv file as test_sop_truth.csv
.
测试集:test_sop_truth.csv
Subject,Predicate,Object,Computed,Truth
Nigel_Cole,isMarriedTo,Kate_Isitt,0.9350595474,1
Véra_Clouzot,isMarriedTo,Henri-Georges_Clouzot,0.4773990512,1
Norodom_Sihanouk,produced,The_Last_Days_of_Colonel_Savath,0.3942225575,1
Farouk_of_Egypt,isMarriedTo,Farida_of_Egypt,0.4276426733,1
然后,我想检查每个要素的形状,并希望看到与将相同的变换应用于两个数据集相同的要素数量.但是他们有所不同.
Then I wanted to check the shape of the features for each and expected to see the same number of features as I am applying the same transformations to both datasets. But they differed.
Python代码
import pandas as pd
import numpy as np
from termcolor import colored
features = pd.read_csv('../Data/train_sop_truth.csv')
testFeatures = pd.read_csv('../Data/test_sop_truth.csv')
print(features.head(5))
print(colored('\nThe shape of our features is:','green'), features.shape)
print(colored('\nThe shape of our Test features is:','green'), testFeatures.shape)
print()
print(colored('\n DESCRIPTIVE STATISTICS\n','yellow'))
print(colored(features.describe(),'cyan'))
print()
print(colored(testFeatures.describe(),'cyan'))
features = pd.get_dummies(features)
testFeatures = pd.get_dummies(testFeatures)
features.iloc[:,5:].head(5)
testFeatures.iloc[:,5].head(5)
labels = np.array(features['Truth'])
testlabels = np.array(testFeatures['Truth'])
features= features.drop('Truth', axis = 1)
testFeatures = testFeatures.drop('Truth', axis = 1)
feature_list = list(features.columns)
testFeature_list = list(testFeatures.columns)
features = np.array(features)
testFeatures = np.array(testFeatures)
train_samples = 100
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size=0.25, random_state = 42)
X_train, X_test, y_train, y_test = model_selection.train_test_split(features, labels, test_size = 0.25, random_state = 42)
print(colored('\n TRAINING & TESTING SETS','yellow'))
print(colored('\nTraining Features Shape:','magenta'), X_train.shape)
print(colored('Training Labels Shape:','magenta'), X_test.shape)
print(colored('Testing Features Shape:','magenta'), y_train.shape)
print(colored('Testing Labels Shape:','magenta'), y_test.shape)
print()
print(colored('\n TRAINING & TESTING SETS','yellow'))
print(colored('\nTraining Features Shape:','magenta'), testX_train.shape)
print(colored('Training Labels Shape:','magenta'), textX_test.shape)
print(colored('Testing Features Shape:','magenta'), testy_train.shape)
print(colored('Testing Labels Shape:','magenta'), testy_test.shape)
输出
The shape of our features is: (1860, 5)
The shape of our Test features is: (1386, 5)
DESCRIPTIVE STATISTICS
Computed Truth
count 1860.000000 1860.000000
mean 0.443222 0.913441
std 0.110788 0.281264
min 0.000000 0.000000
25% 0.418164 1.000000
50% 0.427643 1.000000
75% 0.450023 1.000000
max 1.000000 1.000000
Computed Truth
count 1386.000000 1386.000000
mean 0.511809 0.992063
std 0.197954 0.088765
min 0.009042 0.000000
25% 0.418649 1.000000
50% 0.429140 1.000000
75% 0.515809 1.000000
max 1.702856 1.000000
TRAINING & TESTING SETS
Training Features Shape: (1395, 1045)
Training Labels Shape: (465, 1045)
Testing Features Shape: (1395,)
Testing Labels Shape: (465,)
TRAINING & TESTING SETS
Training Features Shape: (1039, 1790)
Training Labels Shape: (347, 1790)
Testing Features Shape: (1039,)
Testing Labels Shape: (347,)
我在这里不了解的是,尽管经历了相同的变换并且具有相同的编号,要素的形状如何与要素(训练集)的1045
和testFeatures(测试集)的1790
不同文件中的功能和功能形式.
What I do not understand here is how the feature shape can be different as 1045
for the features(training set) and 1790
for the testFeatures (testing set), despite undergoing the same transformations and having the same number of features and form of features in the csv files.
在这方面的任何建议或澄清将不胜感激.
Any suggestions or clarifications in this regard will be much appreciated.
推荐答案
当您为测试数据集应用get_dummies时,根据分类变量的数据值,您可能会添加或删除的列很少.
when you apply get_dummies for the test dataset you might have got few columns added or deleted based on the data values of the categorical variables.
def add_missing_dummy_columns( d, columns ):
missing_cols = set( columns ) - set( d.columns )
for c in missing_cols:
d[c] = 0
def fix_columns( d, columns ):
add_missing_dummy_columns( d, columns )
# make sure we have all the columns we need
assert( set( columns ) - set( d.columns ) == set())
extra_cols = set( d.columns ) - set( columns )
if extra_cols: print ("extra columns:", extra_cols)
d = d[ columns ]
return d
testFeatures= fix_columns( testFeatures, features.columns )
这篇关于相同形式的数据集具有2种不同的形状的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!