在不同的数据集上运行经过训练的机器学习模型 [英] Run trained Machine Learning model on a different dataset
问题描述
我是机器学习的新手,正在尝试运行一个简单的分类模型,该模型是我用pickle训练和保存的,并且存储在另一个格式相同的数据集上.我有以下python代码.
I am new to Machine Learning and am in the process of trying to run a simple classification model that I trained and saved using pickle, on another dataset of the same format. I have the following python code.
代码
#Training set
features = pd.read_csv('../Data/Train_sop_Computed.csv')
#Testing set
testFeatures = pd.read_csv('../Data/Test_sop_Computed.csv')
print(colored('\nThe shape of our features is:','green'), features.shape)
print(colored('\nThe shape of our Test features is:','green'), testFeatures.shape)
features = pd.get_dummies(features)
testFeatures = pd.get_dummies(testFeatures)
features.iloc[:,5:].head(5)
testFeatures.iloc[:,5].head(5)
labels = np.array(features['Truth'])
testlabels = np.array(testFeatures['Truth'])
features= features.drop('Truth', axis = 1)
testFeatures = testFeatures.drop('Truth', axis = 1)
feature_list = list(features.columns)
testFeature_list = list(testFeatures.columns)
def add_missing_dummy_columns(d, columns):
missing_cols = set(columns) - set(d.columns)
for c in missing_cols:
d[c] = 0
def fix_columns(d, columns):
add_missing_dummy_columns(d, columns)
# make sure we have all the columns we need
assert (set(columns) - set(d.columns) == set())
extra_cols = set(d.columns) - set(columns)
if extra_cols: print("extra columns:", extra_cols)
d = d[columns]
return d
testFeatures = fix_columns(testFeatures, features.columns)
features = np.array(features)
testFeatures = np.array(testFeatures)
train_samples = 100
X_train, X_test, y_train, y_test = model_selection.train_test_split(features, labels, test_size = 0.25, random_state = 42)
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)
print(colored('\n TRAINING SET','yellow'))
print(colored('\nTraining Features Shape:','magenta'), X_train.shape)
print(colored('Training Labels Shape:','magenta'), X_test.shape)
print(colored('Testing Features Shape:','magenta'), y_train.shape)
print(colored('Testing Labels Shape:','magenta'), y_test.shape)
print(colored('\n TESTING SETS','yellow'))
print(colored('\nTraining Features Shape:','magenta'), testX_train.shape)
print(colored('Training Labels Shape:','magenta'), textX_test.shape)
print(colored('Testing Features Shape:','magenta'), testy_train.shape)
print(colored('Testing Labels Shape:','magenta'), testy_test.shape)
from sklearn.metrics import precision_recall_fscore_support
import pickle
loaded_model_RFC = pickle.load(open('../other/SOPmodel_RFC', 'rb'))
result_RFC = loaded_model_RFC.score(textX_test, testy_test)
print(colored('Random Forest Classifier: ','magenta'),result_RFC)
loaded_model_SVC = pickle.load(open('../other/SOPmodel_SVC', 'rb'))
result_SVC = loaded_model_SVC.score(textX_test, testy_test)
print(colored('Support Vector Classifier: ','magenta'),result_SVC)
loaded_model_GPC = pickle.load(open('../other/SOPmodel_Gaussian', 'rb'))
result_GPC = loaded_model_GPC.score(textX_test, testy_test)
print(colored('Gaussian Process Classifier: ','magenta'),result_GPC)
loaded_model_SGD = pickle.load(open('../other/SOPmodel_SGD', 'rb'))
result_SGD = loaded_model_SGD.score(textX_test, testy_test)
print(colored('Stocastic Gradient Descent: ','magenta'),result_SGD)
我能够获得测试集的结果.
I am able to get the results for the test set.
但是我面临的问题是我需要在整个
Test_sop_Computed.csv
数据集上运行模型.但是它仅在我分割的测试数据集上运行. 如果有人可以就如何在整个数据集上运行加载的模型提供任何建议,我将不胜感激.我知道下面的代码行是错误的.
But the problem I am facing is that I need to run the model on the entire
Test_sop_Computed.csv
dataset. But it is only being run on the test dataset that I've split. I would sincerely appreciate if anyone could provide any suggestions on how I can run the loaded model on the entire dataset. I know that I'm going wrong with the following line of code.
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)
训练和测试数据集均具有Subject
,Predicate
,Object
,Computed
和Truth
以及具有Truth
作为预测类的特征.测试数据集具有该Truth
列的实际值,我使用testFeatures = testFeatures.drop('Truth', axis = 1)
进行了处理,并打算使用各种加载的分类器模型将Truth
预测为 0 或 1 (表示整个数据集),然后以数组形式获取预测.
Both the train and test dataset have the Subject
, Predicate
, Object
, Computed
and Truth
and the features with the Truth
being the predicted class. The testing dataset has the actual values for this Truth
column and I dopr it usingtestFeatures = testFeatures.drop('Truth', axis = 1)
and intend on using the various loaded models of classifiers to predict this Truth
as 0 or 1 for the entire dataset and then get the predictions as an array.
到目前为止,我已经做到了.但是我认为我也在拆分测试数据集.有没有办法通过整个测试数据集,即使它在另一个文件中?
I have done this so far. But I think that I am splitting my test dataset as well. Is there a way to pass the entire test dataset even if it is in another file?
此测试数据集与训练集的格式相同.我检查了两者的形状,然后得到以下结果.
This test dataset is in the same format as the training set. I have checked the shape of the two and I get the following.
确认特征和形状
Shape of the Train features is: (1860, 5)
Shape of the Test features is: (1386, 5)
TRAINING SET
Training Features Shape: (1395, 1045)
Training Labels Shape: (465, 1045)
Testing Features Shape: (1395,)
Testing Labels Shape: (465,)
TEST SETS
Training Features Shape: (1039, 1045)
Training Labels Shape: (347, 1045)
Testing Features Shape: (1039,)
Testing Labels Shape: (347,)
在这方面的任何建议将受到高度赞赏.
Any suggestions in this regard will be highly appreciated.
推荐答案
您的问题尚不清楚,但据我了解,您想在 testX_train 和 testX_test (只是 testFeatures 分为两个子数据集).
Your question is a bit unclear but as I understand, you want to run your model on testX_train and on testX_test (which is just testFeatures splitted into two sub datasets).
因此,您可以像在 testX_test 中一样在 testX_train 上运行模型. :
So, either you can run your model on testX_train the same way you did for testX_test, e.g. :
result_RFC_train = loaded_model_RFC.score(textX_train, testy_train)
或者您可以删除以下行:
or you can just remove the following line :
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)
因此,您无需拆分数据就可以在完整的数据集上运行它:
So you just don't split you data and run it on the full dataset :
result_RFC_train = loaded_model_RFC.score(testFeatures, testlabels)
这篇关于在不同的数据集上运行经过训练的机器学习模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!