为什么sklearn预处理LabelEncoder inverse_transform仅适用于一列? [英] Why does sklearn preprocessing LabelEncoder inverse_transform apply from only one column?

查看:722
本文介绍了为什么sklearn预处理LabelEncoder inverse_transform仅适用于一列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用sklearn构建的随机森林模型.该模型建立在一个文件中,而我还有另一个文件,在这里我使用joblib加载模型并将其应用于新数据.数据具有通过sklearn的预处理LabelEncoder.fit_transform转换的分类字段.做出预测后,我尝试使用LabelEncoder.inverse_transform反转此转换.

I have a random forest model built with sklearn. The model is built in one file, and I have a second file where I use joblib to load the model and apply it to new data. The data has categorical fields that are converted via sklearn's preprocessing LabelEncoder.fit_transform. Once the prediction is made, I am attempting to reverse this conversion with LabelEncoder.inverse_transform.

这是代码:

 #transform the categorical rf inputs
 df["method"] = le.fit_transform(df["method"])
 df["vendor"] = le.fit_transform(df["vendor"])
 df["type"] = le.fit_transform(df["type"])
 df["name"] = le.fit_transform(df["name"])
 dups["address"] = le.fit_transform(df["address"])

 #designate inputs for rf model
 inputs = ["amt","vendor","type","name","address","method"]

 #load rf model and run it on new data
 from sklearn.externals import joblib
 rf = joblib.load('rf.pkl')
 predict = rf.predict(df[inputs])

 #reverse LabelEncoder fit_transform
 df["method"] = le.inverse_transform(df["method"])
 df["vendor"] = le.inverse_transform(df["vendor"])
 df["type"] = le.inverse_transform(df["type"])
 df["name"] = le.inverse_transform(df["name"])
 df["address"] = le.inverse_transform(df["address"])

 #convert target to numeric to make it play nice with SQL Server
 predict = pd.to_numeric(predict)

 #add target field to df
 df["prediction"] = predict

 #write results to SQL Server table
 import sqlalchemy
 engine = sqlalchemy.create_engine("mssql+pyodbc://<username>:<password>@UserDSN")
 df.to_sql('TABLE_NAME', engine, schema='SCHEMANAME', if_exists='replace', index=False)

如果不使用inverse_transform,结果将与预期的一样:用数字代码代替分类值.对于inverse_transform块,结果很奇怪:对于 all 所有分类字段,返回与地址"字段相对应的分类值.

Without the inverse_transform piece, the results are as expected: numeric codes in place of categorical values. With the inverse_transform piece, the results are odd: the categorical values corresponding to the "address" field are returned for all categorical fields.

因此,如果将1600 Pennsylvania Avenue编码为数字1,则所有编码为1的所有分类值(无论字段如何)现在返回1600 Pennsylvania Avenue.为什么inverse_transform选择一列来反转所有fit_transform代码?

So if 1600 Pennsylvania Avenue is encoded as the number 1, all categorical values encoded as the number 1 (regardless of field) now return 1600 Pennsylvania Avenue. Why is inverse_transform picking one column from which to reverse all fit_transform codes?

推荐答案

我知道这是一个老问题,但是对于每个喜欢方便的人来说:

I know this is an old question, however for everyone who likes convenience:

应用,结合 lambda 可以轻松转换多个/所有列

apply, coupled with lambda can transform multiple/all columns with ease

df = df.apply(lambda col: le.fit_transform(col))

除非如此必要,否则我会鄙视非混淆,非动态的代码(您也应该),

I despise non-aliased, non-dynamic code (you should too) like so, unless really necessary:

 df["method"] = le_method.fit_transform(df["method"])
 df["vendor"] = le_vendor.fit_transform(df["vendor"])
 df["type"] = le_type.fit_transform(df["type"])
 df["name"] = le_name.fit_transform(df["name"])
 df["address"] = le_address.fit_transform(df["address"])

这篇关于为什么sklearn预处理LabelEncoder inverse_transform仅适用于一列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆