为什么sklearn预处理LabelEncoder inverse_transform仅适用于一列? [英] Why does sklearn preprocessing LabelEncoder inverse_transform apply from only one column?
问题描述
我有一个使用sklearn构建的随机森林模型.该模型建立在一个文件中,而我还有另一个文件,在这里我使用joblib加载模型并将其应用于新数据.数据具有通过sklearn的预处理LabelEncoder.fit_transform
转换的分类字段.做出预测后,我尝试使用LabelEncoder.inverse_transform
反转此转换.
I have a random forest model built with sklearn. The model is built in one file, and I have a second file where I use joblib to load the model and apply it to new data. The data has categorical fields that are converted via sklearn's preprocessing LabelEncoder.fit_transform
. Once the prediction is made, I am attempting to reverse this conversion with LabelEncoder.inverse_transform
.
这是代码:
#transform the categorical rf inputs
df["method"] = le.fit_transform(df["method"])
df["vendor"] = le.fit_transform(df["vendor"])
df["type"] = le.fit_transform(df["type"])
df["name"] = le.fit_transform(df["name"])
dups["address"] = le.fit_transform(df["address"])
#designate inputs for rf model
inputs = ["amt","vendor","type","name","address","method"]
#load rf model and run it on new data
from sklearn.externals import joblib
rf = joblib.load('rf.pkl')
predict = rf.predict(df[inputs])
#reverse LabelEncoder fit_transform
df["method"] = le.inverse_transform(df["method"])
df["vendor"] = le.inverse_transform(df["vendor"])
df["type"] = le.inverse_transform(df["type"])
df["name"] = le.inverse_transform(df["name"])
df["address"] = le.inverse_transform(df["address"])
#convert target to numeric to make it play nice with SQL Server
predict = pd.to_numeric(predict)
#add target field to df
df["prediction"] = predict
#write results to SQL Server table
import sqlalchemy
engine = sqlalchemy.create_engine("mssql+pyodbc://<username>:<password>@UserDSN")
df.to_sql('TABLE_NAME', engine, schema='SCHEMANAME', if_exists='replace', index=False)
如果不使用inverse_transform
,结果将与预期的一样:用数字代码代替分类值.对于inverse_transform
块,结果很奇怪:对于 all 所有分类字段,返回与地址"字段相对应的分类值.
Without the inverse_transform
piece, the results are as expected: numeric codes in place of categorical values. With the inverse_transform
piece, the results are odd: the categorical values corresponding to the "address" field are returned for all categorical fields.
因此,如果将1600 Pennsylvania Avenue编码为数字1,则所有编码为1的所有分类值(无论字段如何)现在返回1600 Pennsylvania Avenue.为什么inverse_transform
选择一列来反转所有fit_transform
代码?
So if 1600 Pennsylvania Avenue is encoded as the number 1, all categorical values encoded as the number 1 (regardless of field) now return 1600 Pennsylvania Avenue. Why is inverse_transform
picking one column from which to reverse all fit_transform
codes?
推荐答案
我知道这是一个老问题,但是对于每个喜欢方便的人来说:
I know this is an old question, however for everyone who likes convenience:
应用,结合 lambda 可以轻松转换多个/所有列
apply, coupled with lambda can transform multiple/all columns with ease
df = df.apply(lambda col: le.fit_transform(col))
除非如此必要,否则我会鄙视非混淆,非动态的代码(您也应该),
I despise non-aliased, non-dynamic code (you should too) like so, unless really necessary:
df["method"] = le_method.fit_transform(df["method"])
df["vendor"] = le_vendor.fit_transform(df["vendor"])
df["type"] = le_type.fit_transform(df["type"])
df["name"] = le_name.fit_transform(df["name"])
df["address"] = le_address.fit_transform(df["address"])
这篇关于为什么sklearn预处理LabelEncoder inverse_transform仅适用于一列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!