合并来自原始 pandas DataFrame的model.predict()结果? [英] Merging results from model.predict() with original pandas DataFrame?
问题描述
我正在尝试将predict
方法的结果与pandas.DataFrame
对象中的原始数据合并回去.
I am trying to merge the results of a predict
method back with the original data in a pandas.DataFrame
object.
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df['class'] = data.target
X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
要将这些预测与原始的df
合并在一起,请尝试以下操作:
To merge these predictions back with the original df
, I try this:
df['y_hats'] = y_hats
但这引起了
ValueError:值的长度与索引的长度不匹配
ValueError: Length of values does not match length of index
我知道我可以将df
分为train_df
和test_df
,并且可以解决此问题,但实际上,我需要按照上面的路径创建矩阵X
和y
(我实际的问题是文本分类问题,其中我将 entire 特征矩阵归一化,然后分解为训练和测试).我如何将这些预测值与df
中的适当行对齐,因为y_hats
数组的索引为零,并且似乎有关哪些行的所有信息都包含在X_test
和y_test
迷路了吗?还是我会被降级为先将数据帧拆分为训练测试,然后再构建特征矩阵?我只想用数据框中的np.nan
值填充train
中包含的行.
I know I could split the df
into train_df
and test_df
and this problem would be solved, but in reality I need to follow the path above to create the matrices X
and y
(my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df
, since the y_hats
array is zero-indexed and seemingly all information about which rows were included in the X_test
and y_test
is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train
with np.nan
values in the dataframe.
推荐答案
您的y_hats长度仅是测试数据上的长度(20%),因为您在X_test上进行了预测.一旦模型通过验证并且对测试预测满意(通过检查模型在X_test预测上与X_test真实值相比的准确性),您应该在完整数据集(X)上重新运行预测.将这两行添加到底部:
your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:
y_hats2 = model.predict(X)
df['y_hats'] = y_hats2
根据您的评论进行
EDIT ,这是更新后的结果,返回带有预测数据集的数据集并附加在测试数据集中
EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_iris()
# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)
# add outcome variable
df_class = pd.DataFrame(data = data.target)
# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# I've got my predictions now
y_hats = model.predict(X_test)
y_test['preds'] = y_hats
df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
这篇关于合并来自原始 pandas DataFrame的model.predict()结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!