Sklearn Label编码多列 pandas 数据帧 [英] Sklearn Label Encoding multiple columns pandas dataframe

查看:177
本文介绍了Sklearn Label编码多列 pandas 数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试在其中编码一些包含分类数据的列()大熊猫数据框。完整的数据帧包含超过400列,因此我寻找一种无需对它们进行逐一编码即可对所有所需列进行编码的方法。我使用Scikit-learn LabelEncoder 对分类数据进行编码。

I try to encode a number of columns containing categorical data ("Yes" and "No") in a large pandas dataframe. The complete dataframe contains over 400 columns so I look for a way to encode all desired columns without having to encode them one by one. I use Scikit-learn LabelEncoder to encode the categorical data.

数据帧的第一部分不必进行编码,但是我正在寻找一种方法,直接对所有包含分类日期的所需列进行编码,而无需拆分和连接数据框。

The first part of the dataframe does not have to be encoded, however I am looking for a method to encode all the desired columns containing categorical date directly without split and concatenate the dataframe.

为了演示我的问题,我首先尝试在数据框的一小部分上解决它。但是,在数据拟合和转换的最后部分卡住了,并得到了 ValueError:错误的输入形状(4,3)。我运行的代码是:

To demonstrate my question I first tried to solve it on a small part of the dataframe. However get stuck at the last part where the data is fitted and transformed and get a ValueError: bad input shape (4,3). The code as I ran:

# Create a simple dataframe resembling large dataframe
    data = pd.DataFrame({'A': [1, 2, 3, 4],
                         'B': ["Yes", "No", "Yes", "Yes"],
                         'C': ["Yes", "No", "No", "Yes"],
                         'D': ["No", "Yes", "No", "Yes"]})


# Import required module
from sklearn.preprocessing import LabelEncoder

# Create an object of the label encoder class
labelencoder = LabelEncoder()

# Apply labelencoder object on columns
labelencoder.fit_transform(data.ix[:, 1:])   # First column does not need to be encoded

完整的错误报告:

labelencoder.fit_transform(data.ix[:, 1:])
Traceback (most recent call last):

  File "<ipython-input-47-b4986a719976>", line 1, in <module>
    labelencoder.fit_transform(data.ix[:, 1:])

  File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 129, in fit_transform
    y = column_or_1d(y, warn=True)

  File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in column_or_1d
    raise ValueError("bad input shape {0}".format(shape))

ValueError: bad input shape (4, 3)

有人知道怎么做吗?

推荐答案

如以下代码所示,您可以通过应用 LabelEncoder 转到DataFrame。但是,请注意,我们无法获取所有列的类信息。

As the following code, you can encode the multiple columns by applying LabelEncoder to DataFrame. However, please note that we cannot obtain the classes information for all columns.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'A': [1, 2, 3, 4],
                   'B': ["Yes", "No", "Yes", "Yes"],
                   'C': ["Yes", "No", "No", "Yes"],
                   'D': ["No", "Yes", "No", "Yes"]})
print(df)
#    A    B    C    D
# 0  1  Yes  Yes   No
# 1  2   No   No  Yes
# 2  3  Yes   No   No
# 3  4  Yes  Yes  Yes

# LabelEncoder
le = LabelEncoder()

# apply "le.fit_transform"
df_encoded = df.apply(le.fit_transform)
print(df_encoded)
#    A  B  C  D
# 0  0  1  1  0
# 1  1  0  0  1
# 2  2  1  0  0
# 3  3  1  1  1

# Note: we cannot obtain the classes information for all columns.
print(le.classes_)
# ['No' 'Yes']

这篇关于Sklearn Label编码多列 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆