sklearn 中的多列单热编码和命名列 [英] One-hot-encoding multiple columns in sklearn and naming columns
问题描述
我有以下代码可以对我拥有的 2 列进行单热编码.
I have the following code to one-hot-encode 2 columns I have.
# encode city labels using one-hot encoding scheme
city_ohe = OneHotEncoder(categories='auto')
city_feature_arr = city_ohe.fit_transform(df[['city']]).toarray()
city_feature_labels = city_ohe.categories_
city_features = pd.DataFrame(city_feature_arr, columns=city_feature_labels)
phone_ohe = OneHotEncoder(categories='auto')
phone_feature_arr = phone_ohe.fit_transform(df[['phone']]).toarray()
phone_feature_labels = phone_ohe.categories_
phone_features = pd.DataFrame(phone_feature_arr, columns=phone_feature_labels)
我想知道的是如何在 4 行中执行此操作,同时在输出中正确命名列.也就是说,我可以通过在 fit_transform
中包含两个列名称来创建一个正确的单热编码数组,但是当我尝试命名结果数据框的列时,它告诉我形状之间存在不匹配指数:
What I'm wondering is how I do this in 4 lines while getting properly named columns in the output. That is, I can create a properly one-hot-encoded array by include both columns names in fit_transform
but when I try and name the resulting dataframe's columns, it tells me that there is a mismatch between the shape of the indices:
ValueError: Shape of passed values is (6, 50000), indices imply (3, 50000)
对于背景,电话和城市都有 3 个值.
For background, both phone and city have 3 values.
city phone
0 CityA iPhone
1 CityB Android
2 CityB iPhone
3 CityA iPhone
4 CityC Android
推荐答案
你就快到了...就像你说的,你可以直接在 fit_transform
中添加所有你想编码的列.
You you are almost there... Like you said you can add all the columns you want to encode in fit_transform
directly.
ohe = OneHotEncoder(categories='auto')
feature_arr = ohe.fit_transform(df[['phone','city']]).toarray()
feature_labels = ohe.categories_
然后您只需要执行以下操作:
And then you just need to do the following:
feature_labels = np.array(feature_labels).ravel()
这使您可以根据需要命名列:
Which enables you to name your columns like you wanted:
features = pd.DataFrame(feature_arr, columns=feature_labels)
这篇关于sklearn 中的多列单热编码和命名列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!