相同形式的数据集具有2种不同的形状 [英] Same form of dataset has 2 different shapes

查看：124 发布时间：2020/5/4 10:12:58 python machine-learning classification training-data

本文介绍了相同形式的数据集具有2种不同的形状的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对机器学习非常陌生，只是在掌握技术.因此，我正在尝试使用具有4个特征和目标特征/类(真值 1 或 0 )的数据集在以下分类器上训练模型.

I am quite new to Machine Learning and am just grasping the techniques. As such, I am trying to train a model on the following classifiers using a dataset that has 4 features and the target feature/class (the truth value 1 or 0).

分类

SGD分类器
随机森林分类器
线性支持向量分类器
高斯过程分类器

我正在以下数据集上训练模型[部分数据集如下所示.

I am training the model on the following dataset [Part of the dataset is shown below].

培训集:train_sop_truth.csv

Subject,Predicate,Object,Computed,Truth
concept:sportsteam:hawks,concept:teamplaysincity,concept:city:atlanta,0.4255912602,1
concept:stadiumoreventvenue:honda+AF8-center,concept:stadiumlocatedincity,concept:city:anaheim,0.4276425838,1
concept:sportsteam:ducks,concept:teamplaysincity,concept:city:anaheim,0.4762486517,1
concept:sportsteam:n1985+AF8-chicago+AF8-bears,concept:teamplaysincity,concept:city:chicago,0.4106097221,1
concept:stadiumoreventvenue:philips+AF8-arena,concept:stadiumlocatedincity,concept:city:atlanta,0.4190083146,1
concept:stadiumoreventvenue:united+AF8-center,concept:stadiumlocatedincity,concept:city:chicago,0.4211134315,1

测试数据集位于另一个.csv文件中，作为test_sop_truth.csv.

The test dataset is in another .csv file as test_sop_truth.csv.

测试集:test_sop_truth.csv

Subject,Predicate,Object,Computed,Truth
Nigel_Cole,isMarriedTo,Kate_Isitt,0.9350595474,1
Véra_Clouzot,isMarriedTo,Henri-Georges_Clouzot,0.4773990512,1
Norodom_Sihanouk,produced,The_Last_Days_of_Colonel_Savath,0.3942225575,1
Farouk_of_Egypt,isMarriedTo,Farida_of_Egypt,0.4276426733,1

然后，我想检查每个要素的形状，并希望看到与将相同的变换应用于两个数据集相同的要素数量.但是他们有所不同.

Then I wanted to check the shape of the features for each and expected to see the same number of features as I am applying the same transformations to both datasets. But they differed.

Python代码

import pandas as pd
import numpy as np
from termcolor import colored

features = pd.read_csv('../Data/train_sop_truth.csv')
testFeatures = pd.read_csv('../Data/test_sop_truth.csv')
print(features.head(5))

print(colored('\nThe shape of our features is:','green'), features.shape)
print(colored('\nThe shape of our Test features is:','green'), testFeatures.shape)

print()
print(colored('\n     DESCRIPTIVE STATISTICS\n','yellow'))
print(colored(features.describe(),'cyan'))
print()
print(colored(testFeatures.describe(),'cyan'))


features = pd.get_dummies(features)
testFeatures = pd.get_dummies(testFeatures)

features.iloc[:,5:].head(5)
testFeatures.iloc[:,5].head(5)

labels = np.array(features['Truth'])
testlabels = np.array(testFeatures['Truth'])


features= features.drop('Truth', axis = 1)
testFeatures = testFeatures.drop('Truth', axis = 1)

feature_list = list(features.columns)
testFeature_list = list(testFeatures.columns)

features = np.array(features)
testFeatures = np.array(testFeatures)

train_samples = 100


testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size=0.25, random_state = 42)

X_train, X_test, y_train, y_test = model_selection.train_test_split(features, labels, test_size = 0.25, random_state = 42)

print(colored('\n    TRAINING & TESTING SETS','yellow'))
print(colored('\nTraining Features Shape:','magenta'), X_train.shape)
print(colored('Training Labels Shape:','magenta'), X_test.shape)
print(colored('Testing Features Shape:','magenta'), y_train.shape)
print(colored('Testing Labels Shape:','magenta'), y_test.shape)

print()

print(colored('\n    TRAINING & TESTING SETS','yellow'))
print(colored('\nTraining Features Shape:','magenta'), testX_train.shape)
print(colored('Training Labels Shape:','magenta'), textX_test.shape)
print(colored('Testing Features Shape:','magenta'), testy_train.shape)
print(colored('Testing Labels Shape:','magenta'), testy_test.shape)

输出

The shape of our features is: (1860, 5)

The shape of our Test features is: (1386, 5)


     DESCRIPTIVE STATISTICS

          Computed        Truth
count  1860.000000  1860.000000
mean      0.443222     0.913441
std       0.110788     0.281264
min       0.000000     0.000000
25%       0.418164     1.000000
50%       0.427643     1.000000
75%       0.450023     1.000000
max       1.000000     1.000000

          Computed        Truth
count  1386.000000  1386.000000
mean      0.511809     0.992063
std       0.197954     0.088765
min       0.009042     0.000000
25%       0.418649     1.000000
50%       0.429140     1.000000
75%       0.515809     1.000000
max       1.702856     1.000000

    TRAINING & TESTING SETS

Training Features Shape: (1395, 1045)
Training Labels Shape: (465, 1045)
Testing Features Shape: (1395,)
Testing Labels Shape: (465,)


    TRAINING & TESTING SETS

Training Features Shape: (1039, 1790)
Training Labels Shape: (347, 1790)
Testing Features Shape: (1039,)
Testing Labels Shape: (347,)

我在这里不了解的是，尽管经历了相同的变换并且具有相同的编号，要素的形状如何与要素(训练集)的1045和testFeatures(测试集)的1790不同文件中的功能和功能形式.

What I do not understand here is how the feature shape can be different as 1045 for the features(training set) and 1790 for the testFeatures (testing set), despite undergoing the same transformations and having the same number of features and form of features in the csv files.

在这方面的任何建议或澄清将不胜感激.

Any suggestions or clarifications in this regard will be much appreciated.

相同形式的数据集具有2种不同的形状 [英] Same form of dataset has 2 different shapes

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

相同形式的数据集具有2种不同的形状 [英] Same form of dataset has 2 different shapes

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭