将来自 model.predict() 的结果与原始 Pandas DataFrame 合并? [英] Merging results from model.predict() with original pandas DataFrame?

查看:48
本文介绍了将来自 model.predict() 的结果与原始 Pandas DataFrame 合并?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 predict 方法的结果与 pandas.DataFrame 对象中的原始数据合并.

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df['class'] = data.target

X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

为了将这些预测与原始 df 合并,我试试这个:

To merge these predictions back with the original df, I try this:

df['y_hats'] = y_hats

但这会引发:

ValueError: 值的长度与索引的长度不匹配

ValueError: Length of values does not match length of index

我知道我可以将 df 拆分为 train_dftest_df 并且这个问题将得到解决,但实际上我需要遵循上面创建矩阵 Xy 的路径(我的实际问题是一个文本分类问题,在拆分为之前,我将整个特征矩阵标准化训练和测试).我如何将这些预测值与我的 df 中的适当行对齐,因为 y_hats 数组是零索引的,并且似乎所有关于 which 的信息X_test 中包含的行和 y_test 丢失了?或者我会被降级为首先将数据帧拆分为训练测试,然后构建特征矩阵?我只想用数据帧中的 np.nan 值填充 train 中包含的行.

I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.

推荐答案

你的 y_hats 长度只会是测试数据的长度 (20%),因为你是在 X_test 上预测的.一旦您的模型得到验证并且您对测试预测感到满意(通过检查模型在 X_test 预测上与 X_test 真实值相比的准确性),您应该在完整数据集 (X) 上重新运行预测.将这两行添加到底部:

your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:

y_hats2 = model.predict(X)

df['y_hats'] = y_hats2

EDIT 根据您的评论,这是一个更新的结果,它返回数据集,并在测试数据集中的位置附加了预测

EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df_class = pd.DataFrame(data = data.target)

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

这篇关于将来自 model.predict() 的结果与原始 Pandas DataFrame 合并?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆