合并来自原始 pandas DataFrame的model.predict()结果? [英] Merging results from model.predict() with original pandas DataFrame?

查看:283
本文介绍了合并来自原始 pandas DataFrame的model.predict()结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将predict方法的结果与pandas.DataFrame对象中的原始数据合并回去.

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df['class'] = data.target

X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

要将这些预测与原始的df合并在一起,请尝试以下操作:

To merge these predictions back with the original df, I try this:

df['y_hats'] = y_hats

但这引起了

ValueError:值的长度与索引的长度不匹配

ValueError: Length of values does not match length of index

我知道我可以将df分为train_dftest_df,并且可以解决此问题,但实际上,我需要按照上面的路径创建矩阵Xy(我实际的问题是文本分类问题,其中我将 entire 特征矩阵归一化,然后分解为训练和测试).我如何将这些预测值与df中的适当行对齐,因为y_hats数组的索引为零,并且似乎有关哪些行的所有信息都包含在X_testy_test迷路了吗?还是我会被降级为先将数据帧拆分为训练测试,然后再构建特征矩阵?我只想用数据框中的np.nan值填充train中包含的行.

I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.

推荐答案

您的y_hats长度仅是测试数据上的长度(20%),因为您在X_test上进行了预测.一旦模型通过验证并且对测试预测满意(通过检查模型在X_test预测上与X_test真实值相比的准确性),您应该在完整数据集(X)上重新运行预测.将这两行添加到底部:

your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:

y_hats2 = model.predict(X)

df['y_hats'] = y_hats2

根据您的评论进行

EDIT ,这是更新后的结果,返回带有预测数据集的数据集并附加在测试数据集中

EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df_class = pd.DataFrame(data = data.target)

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

这篇关于合并来自原始 pandas DataFrame的model.predict()结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆