scikit-learn 中跨多列的标签编码 [英] Label encoding across multiple columns in scikit-learn

查看:34
本文介绍了scikit-learn 中跨多列的标签编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 scikit-learn 的 LabelEncoder 对字符串标签的熊猫 DataFrame 进行编码.由于数据框有很多(50+)列,我想避免为每一列创建一个 LabelEncoder 对象;我宁愿只有一个大的 LabelEncoder 对象,它可以在我的所有数据列中工作.

将整个 DataFrame 放入 LabelEncoder 会产生以下错误.请记住,我在这里使用的是虚拟数据;实际上,我正在处理大约 50 列标记为字符串的数据,因此需要一个不按名称引用任何列的解决方案.

导入熊猫从 sklearn 导入预处理df = 熊猫.DataFrame({'宠物':['猫','狗','猫','猴子','狗','狗'],'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego','纽约']})le = 预处理.LabelEncoder()le.fit(df)

<块引用>

回溯(最近一次调用最后一次):文件",第 1 行,在文件/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py",第103行,合适y = column_or_1d(y, 警告=真)文件/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py",第 306 行,在 column_or_1draise ValueError("错误的输入形状{0}".format(shape))ValueError: 错误的输入形状 (6, 3)

关于如何解决这个问题的任何想法?

解决方案

你可以很容易地做到这一点,

df.apply(LabelEncoder().fit_transform)

编辑 2:

在 scikit-learn 0.20 中,推荐的方式是

OneHotEncoder().fit_transform(df)

因为 OneHotEncoder 现在支持字符串输入.使用 ColumnTransformer 可以将 OneHotEncoder 仅应用于某些列.

由于这个原始答案是一年多前的,并且产生了很多赞成票(包括赏金),我可能应该进一步扩展它.

对于inverse_transform 和transform,你必须做一些hack.

from collections import defaultdictd = defaultdict(LabelEncoder)

这样,您现在将所有列 LabelEncoder 保留为字典.

# 对变量进行编码适合 = df.apply(lambda x: d[x.name].fit_transform(x))# 反转编码fit.apply(lambda x: d[x.name].inverse_transform(x))# 使用字典标记未来数据df.apply(lambda x: d[x.name].transform(x))

MOAR

使用 Neuraxle 的 FlattenForEach 步骤,也可以使用相同的 LabelEncoder 一次性处理所有扁平化数据:

FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df)

对于根据您的数据列使用单独的 LabelEncoder ,或者如果您的数据列中只有一些需要进行标签编码而不是其他列,则使用 ColumnTransformer 是一种解决方案,可让您更好地控制列选择和 LabelEncoder 实例.

I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'd rather just have one big LabelEncoder objects that works across all my columns of data.

Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.

import pandas
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

le = preprocessing.LabelEncoder()

le.fit(df)

Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad input shape {0}".format(shape)) ValueError: bad input shape (6, 3)

Any thoughts on how to get around this problem?

解决方案

You can easily do this though,

df.apply(LabelEncoder().fit_transform)

EDIT2:

In scikit-learn 0.20, the recommended way is

OneHotEncoder().fit_transform(df)

as the OneHotEncoder now supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.

EDIT:

Since this original answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.

For inverse_transform and transform, you have to do a little bit of hack.

from collections import defaultdict
d = defaultdict(LabelEncoder)

With this, you now retain all columns LabelEncoder as dictionary.

# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))

MOAR EDIT:

Using Neuraxle's FlattenForEach step, it's possible to do this as well to use the same LabelEncoder on all the flattened data at once:

FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df)

For using separate LabelEncoders depending for your columns of data, or if only some of your columns of data needs to be label-encoded and not others, then using a ColumnTransformer is a solution that allows for more control on your column selection and your LabelEncoder instances.

这篇关于scikit-learn 中跨多列的标签编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆