相同形式的数据集具有2种不同的形状 [英] Same form of dataset has 2 different shapes

查看:124
本文介绍了相同形式的数据集具有2种不同的形状的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对机器学习非常陌生,只是在掌握技术.因此,我正在尝试使用具有4个特征和目标特征/类(真值 1 0 )的数据集在以下分类器上训练模型.

I am quite new to Machine Learning and am just grasping the techniques. As such, I am trying to train a model on the following classifiers using a dataset that has 4 features and the target feature/class (the truth value 1 or 0).

分类

  • SGD分类器
  • 随机森林分类器
  • 线性支持向量分类器
  • 高斯过程分类器

我正在以下数据集上训练模型[部分数据集如下所示.

I am training the model on the following dataset [Part of the dataset is shown below].

培训集:train_sop_truth.csv

Subject,Predicate,Object,Computed,Truth
concept:sportsteam:hawks,concept:teamplaysincity,concept:city:atlanta,0.4255912602,1
concept:stadiumoreventvenue:honda+AF8-center,concept:stadiumlocatedincity,concept:city:anaheim,0.4276425838,1
concept:sportsteam:ducks,concept:teamplaysincity,concept:city:anaheim,0.4762486517,1
concept:sportsteam:n1985+AF8-chicago+AF8-bears,concept:teamplaysincity,concept:city:chicago,0.4106097221,1
concept:stadiumoreventvenue:philips+AF8-arena,concept:stadiumlocatedincity,concept:city:atlanta,0.4190083146,1
concept:stadiumoreventvenue:united+AF8-center,concept:stadiumlocatedincity,concept:city:chicago,0.4211134315,1

测试数据集位于另一个.csv文件中,作为test_sop_truth.csv.

The test dataset is in another .csv file as test_sop_truth.csv.

测试集:test_sop_truth.csv

Subject,Predicate,Object,Computed,Truth
Nigel_Cole,isMarriedTo,Kate_Isitt,0.9350595474,1
Véra_Clouzot,isMarriedTo,Henri-Georges_Clouzot,0.4773990512,1
Norodom_Sihanouk,produced,The_Last_Days_of_Colonel_Savath,0.3942225575,1
Farouk_of_Egypt,isMarriedTo,Farida_of_Egypt,0.4276426733,1

然后,我想检查每个要素的形状,并希望看到与将相同的变换应用于两个数据集相同的要素数量.但是他们有所不同.

Then I wanted to check the shape of the features for each and expected to see the same number of features as I am applying the same transformations to both datasets. But they differed.

Python代码

import pandas as pd
import numpy as np
from termcolor import colored

features = pd.read_csv('../Data/train_sop_truth.csv')
testFeatures = pd.read_csv('../Data/test_sop_truth.csv')
print(features.head(5))

print(colored('\nThe shape of our features is:','green'), features.shape)
print(colored('\nThe shape of our Test features is:','green'), testFeatures.shape)

print()
print(colored('\n     DESCRIPTIVE STATISTICS\n','yellow'))
print(colored(features.describe(),'cyan'))
print()
print(colored(testFeatures.describe(),'cyan'))


features = pd.get_dummies(features)
testFeatures = pd.get_dummies(testFeatures)

features.iloc[:,5:].head(5)
testFeatures.iloc[:,5].head(5)

labels = np.array(features['Truth'])
testlabels = np.array(testFeatures['Truth'])


features= features.drop('Truth', axis = 1)
testFeatures = testFeatures.drop('Truth', axis = 1)

feature_list = list(features.columns)
testFeature_list = list(testFeatures.columns)

features = np.array(features)
testFeatures = np.array(testFeatures)

train_samples = 100


testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size=0.25, random_state = 42)

X_train, X_test, y_train, y_test = model_selection.train_test_split(features, labels, test_size = 0.25, random_state = 42)

print(colored('\n    TRAINING & TESTING SETS','yellow'))
print(colored('\nTraining Features Shape:','magenta'), X_train.shape)
print(colored('Training Labels Shape:','magenta'), X_test.shape)
print(colored('Testing Features Shape:','magenta'), y_train.shape)
print(colored('Testing Labels Shape:','magenta'), y_test.shape)

print()

print(colored('\n    TRAINING & TESTING SETS','yellow'))
print(colored('\nTraining Features Shape:','magenta'), testX_train.shape)
print(colored('Training Labels Shape:','magenta'), textX_test.shape)
print(colored('Testing Features Shape:','magenta'), testy_train.shape)
print(colored('Testing Labels Shape:','magenta'), testy_test.shape)

输出

The shape of our features is: (1860, 5)

The shape of our Test features is: (1386, 5)


     DESCRIPTIVE STATISTICS

          Computed        Truth
count  1860.000000  1860.000000
mean      0.443222     0.913441
std       0.110788     0.281264
min       0.000000     0.000000
25%       0.418164     1.000000
50%       0.427643     1.000000
75%       0.450023     1.000000
max       1.000000     1.000000

          Computed        Truth
count  1386.000000  1386.000000
mean      0.511809     0.992063
std       0.197954     0.088765
min       0.009042     0.000000
25%       0.418649     1.000000
50%       0.429140     1.000000
75%       0.515809     1.000000
max       1.702856     1.000000

    TRAINING & TESTING SETS

Training Features Shape: (1395, 1045)
Training Labels Shape: (465, 1045)
Testing Features Shape: (1395,)
Testing Labels Shape: (465,)


    TRAINING & TESTING SETS

Training Features Shape: (1039, 1790)
Training Labels Shape: (347, 1790)
Testing Features Shape: (1039,)
Testing Labels Shape: (347,)

我在这里不了解的是,尽管经历了相同的变换并且具有相同的编号,要素的形状如何与要素(训练集)的1045和testFeatures(测试集)的1790不同文件中的功能和功能形式.

What I do not understand here is how the feature shape can be different as 1045 for the features(training set) and 1790 for the testFeatures (testing set), despite undergoing the same transformations and having the same number of features and form of features in the csv files.

在这方面的任何建议或澄清将不胜感激.

Any suggestions or clarifications in this regard will be much appreciated.

推荐答案

当您为测试数据集应用get_dummies时,根据分类变量的数据值,您可能会添加或删除的列很少.

when you apply get_dummies for the test dataset you might have got few columns added or deleted based on the data values of the categorical variables.

def add_missing_dummy_columns( d, columns ):
        missing_cols = set( columns ) - set( d.columns )
        for c in missing_cols:
            d[c] = 0

def fix_columns( d, columns ):  

    add_missing_dummy_columns( d, columns )

    # make sure we have all the columns we need
    assert( set( columns ) - set( d.columns ) == set())

    extra_cols = set( d.columns ) - set( columns )
    if extra_cols: print ("extra columns:", extra_cols)

    d = d[ columns ]
    return d

testFeatures= fix_columns( testFeatures, features.columns )

这篇关于相同形式的数据集具有2种不同的形状的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆