scikit-learn中跨多列的标签编码 [英] Label encoding across multiple columns in scikit-learn

查看:257
本文介绍了scikit-learn中跨多列的标签编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用scikit-learn的LabelEncoder对字符串标签的熊猫DataFrame进行编码.由于数据框有许多(50+)列,因此我想避免为每一列创建一个LabelEncoder对象.我宁愿只有一个大的LabelEncoder对象,这些对象可以在 all 我的所有数据列中使用.

I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'd rather just have one big LabelEncoder objects that works across all my columns of data.

将整个DataFrame扔到LabelEncoder中会产生以下错误.请记住,我在这里使用伪数据.实际上,我正在处理大约50列的字符串标记数据,因此需要一种不按名称引用任何列的解决方案.

Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.

import pandas
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

le = preprocessing.LabelEncoder()

le.fit(df)

回溯(最近通话最近): 文件",第1行,在 适合的文件"/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py",第103行 y = column_or_1d(y,warn = True) 在column_or_1d中,文件"/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py",行306 引发ValueError(错误的输入形状{0}".format(shape)) ValueError:输入形状错误(6,3)

Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape (6, 3)

关于如何解决此问题的任何想法?

Any thoughts on how to get around this problem?

推荐答案

您可以轻松地做到这一点,

You can easily do this though,

df.apply(LabelEncoder().fit_transform)

在scikit-learn 0.20中,推荐的方法是

In scikit-learn 0.20, the recommended way is

OneHotEncoder().fit_transform(df)

,因为OneHotEncoder现在支持字符串输入. ColumnTransformer可以将OneHotEncoder仅应用于某些列.

as the OneHotEncoder now supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.

由于这个答案是一年多以前的,并且引起了很多反对(包括赏金),所以我可能应该进一步扩大这个范围.

Since this answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.

对于inverse_transform和transform,您必须做一点点改动.

For inverse_transform and transform, you have to do a little bit of hack.

from collections import defaultdict
d = defaultdict(LabelEncoder)

现在,您将所有LabelEncoder列保留为字典.

With this, you now retain all columns LabelEncoder as dictionary.

# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))

这篇关于scikit-learn中跨多列的标签编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆