将来自 model.predict() 的结果与原始 Pandas DataFrame 合并? [英] Merging results from model.predict() with original pandas DataFrame?
问题描述
我正在尝试将 predict
方法的结果与 pandas.DataFrame
对象中的原始数据合并.
I am trying to merge the results of a predict
method back with the original data in a pandas.DataFrame
object.
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
为了将这些预测与原始 df
合并,我试试这个:
To merge these predictions back with the original df
, I try this:
df['y_hats'] = y_hats
但这会引发:
ValueError: 值的长度与索引的长度不匹配
ValueError: Length of values does not match length of index
我知道我可以将 df
拆分为 train_df
和 test_df
并且这个问题将得到解决,但实际上我需要遵循上面创建矩阵 X
和 y
的路径(我的实际问题是一个文本分类问题,在拆分为之前,我将整个特征矩阵标准化训练和测试).我如何将这些预测值与我的 df
中的适当行对齐,因为 y_hats
数组是零索引的,并且似乎所有关于 which 的信息X_test
中包含的行和 y_test
丢失了?或者我会被降级为首先将数据帧拆分为训练测试,然后构建特征矩阵?我只想用数据帧中的 np.nan
值填充 train
中包含的行.
I know I could split the df
into train_df
and test_df
and this problem would be solved, but in reality I need to follow the path above to create the matrices X
and y
(my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df
, since the y_hats
array is zero-indexed and seemingly all information about which rows were included in the X_test
and y_test
is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train
with np.nan
values in the dataframe.
推荐答案
你的 y_hats 长度只会是测试数据的长度 (20%),因为你是在 X_test 上预测的.一旦您的模型得到验证并且您对测试预测感到满意(通过检查模型在 X_test 预测上与 X_test 真实值相比的准确性),您应该在完整数据集 (X) 上重新运行预测.将这两行添加到底部:
your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:
y_hats2 = model.predict(X)
df['y_hats'] = y_hats2
EDIT 根据您的评论,这是一个更新的结果,它返回数据集,并在测试数据集中的位置附加了预测
EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df_class = pd.DataFrame(data = data.target)
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
这篇关于将来自 model.predict() 的结果与原始 Pandas DataFrame 合并?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!