Sklearn Label编码多列 pandas 数据帧 [英] Sklearn Label Encoding multiple columns pandas dataframe
问题描述
我尝试在其中编码一些包含分类数据的列(是
和否
)大熊猫数据框。完整的数据帧包含超过400列,因此我寻找一种无需对它们进行逐一编码即可对所有所需列进行编码的方法。我使用Scikit-learn LabelEncoder
对分类数据进行编码。
I try to encode a number of columns containing categorical data ("Yes"
and "No"
) in a large pandas dataframe. The complete dataframe contains over 400 columns so I look for a way to encode all desired columns without having to encode them one by one. I use Scikit-learn LabelEncoder
to encode the categorical data.
数据帧的第一部分不必进行编码,但是我正在寻找一种方法,直接对所有包含分类日期的所需列进行编码,而无需拆分和连接数据框。
The first part of the dataframe does not have to be encoded, however I am looking for a method to encode all the desired columns containing categorical date directly without split and concatenate the dataframe.
为了演示我的问题,我首先尝试在数据框的一小部分上解决它。但是,在数据拟合和转换的最后部分卡住了,并得到了 ValueError:错误的输入形状(4,3)
。我运行的代码是:
To demonstrate my question I first tried to solve it on a small part of the dataframe. However get stuck at the last part where the data is fitted and transformed and get a ValueError: bad input shape (4,3)
. The code as I ran:
# Create a simple dataframe resembling large dataframe
data = pd.DataFrame({'A': [1, 2, 3, 4],
'B': ["Yes", "No", "Yes", "Yes"],
'C': ["Yes", "No", "No", "Yes"],
'D': ["No", "Yes", "No", "Yes"]})
# Import required module
from sklearn.preprocessing import LabelEncoder
# Create an object of the label encoder class
labelencoder = LabelEncoder()
# Apply labelencoder object on columns
labelencoder.fit_transform(data.ix[:, 1:]) # First column does not need to be encoded
完整的错误报告:
labelencoder.fit_transform(data.ix[:, 1:])
Traceback (most recent call last):
File "<ipython-input-47-b4986a719976>", line 1, in <module>
labelencoder.fit_transform(data.ix[:, 1:])
File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 129, in fit_transform
y = column_or_1d(y, warn=True)
File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (4, 3)
有人知道怎么做吗?
推荐答案
如以下代码所示,您可以通过应用 LabelEncoder $ c $对多列进行编码c>转到DataFrame。但是,请注意,我们无法获取所有列的类信息。
As the following code, you can encode the multiple columns by applying LabelEncoder
to DataFrame. However, please note that we cannot obtain the classes information for all columns.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': ["Yes", "No", "Yes", "Yes"],
'C': ["Yes", "No", "No", "Yes"],
'D': ["No", "Yes", "No", "Yes"]})
print(df)
# A B C D
# 0 1 Yes Yes No
# 1 2 No No Yes
# 2 3 Yes No No
# 3 4 Yes Yes Yes
# LabelEncoder
le = LabelEncoder()
# apply "le.fit_transform"
df_encoded = df.apply(le.fit_transform)
print(df_encoded)
# A B C D
# 0 0 1 1 0
# 1 1 0 0 1
# 2 2 1 0 0
# 3 3 1 1 1
# Note: we cannot obtain the classes information for all columns.
print(le.classes_)
# ['No' 'Yes']
这篇关于Sklearn Label编码多列 pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!