如何在 Sklearn 中执行 OneHotEncoding,获取值错误 [英] How to perform OneHotEncoding in Sklearn, getting value error

查看:66
本文介绍了如何在 Sklearn 中执行 OneHotEncoding,获取值错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚开始学习机器学习,在练习其中一项任务时,我得到了值错误,但我遵循了与讲师相同的步骤.

I just started learning machine learning, when practicing one of the task, I am getting value error, but I followed the same steps as the instructor does.

我收到值错误,请帮忙.

I am getting value error, please help.

dff

     Country    Name
 0     AUS      Sri
 1     USA      Vignesh
 2     IND      Pechi
 3     USA      Raj

首先我进行了标签编码,

First I performed labelencoding,

X=dff.values
label_encoder=LabelEncoder()
X[:,0]=label_encoder.fit_transform(X[:,0])

out:
X
array([[0, 'Sri'],
       [2, 'Vignesh'],
       [1, 'Pechi'],
       [2, 'Raj']], dtype=object)

然后对同一个X进行One热编码

then performed One hot encoding for the same X

onehotencoder=OneHotEncoder( categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()

我收到以下错误:

ValueError                                Traceback (most recent call last)
<ipython-input-472-be8c3472db63> in <module>()
----> 1 X=onehotencoder.fit_transform(X).toarray()

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in fit_transform(self, X, y)
   1900         """
   1901         return _transform_selected(X, self._fit_transform,
-> 1902                                    self.categorical_features, copy=True)
   1903 
   1904     def _transform(self, X):

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in _transform_selected(X, transform, selected, copy)
   1695     X : array or sparse matrix, shape=(n_samples, n_features_new)
   1696     """
-> 1697     X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
   1698 
   1699     if isinstance(selected, six.string_types) and selected == "all":

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
    383 
    384         if ensure_2d:

ValueError: could not convert string to float: 'Raj'

请编辑我的问题有什么问题,提前致谢!

Please edit my question is anything wrong, thanks in advance!

推荐答案

您现在可以直接OneHotEncoding没有使用LabelEncoder,随着我们向0.22版本迈进,许多人可能希望以这种方式避免警告和潜在错误(参见DOCS示例).

You can go directly to OneHotEncoding now without using the LabelEncoder, and as we move toward version 0.22 many might want to do things this way to avoid warnings and potential errors (see DOCS and EXAMPLES).

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

countries = np.unique(X[:,0])
names = np.unique(X[:,1])

ohe = OneHotEncoder(categories=[countries, names])
X = ohe.fit_transform(X).toarray()

print (X)

<小时>

代码示例 1 的输出:

[[1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 1. 0. 0.]]

<小时>

示例代码 2 显示了用于指定类别的自动"选项:

前 3 列对国家名称进行编码,后四列对人名进行编码.


Example code 2 showing the 'auto' option for specification of categories:

The first 3 columns encode the country names, the last four the personal names.

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

ohe = OneHotEncoder(categories='auto')
X = ohe.fit_transform(X).toarray()

print (X)

代码示例 2 的输出(与 1 相同):

[[1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 1. 0. 0.]]

<小时>

示例代码 3,其中只有第一列是一个热编码:

现在,这是独特的部分.如果您只需要对数据的特定列进行一次热编码怎么办?


Example code 3 where only the first column is one hot encoded:

Now, here's the unique part. What if you only need to One Hot Encode a specific column for your data?

(注意:为了便于说明,我将最后一列保留为字符串.实际上,当最后一列已经是数字时,这样做更有意义).

(Note: I've left the last column as strings for easier illustration. In reality it makes more sense to do this WHEN the last column was already numerical).

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

countries = np.unique(X[:,0])
names = np.unique(X[:,1])

ohe = OneHotEncoder(categories=[countries]) # specify ONLY unique country names
tmp = ohe.fit_transform(X[:,0].reshape(-1, 1)).toarray()

X = np.append(tmp, names.reshape(-1,1), axis=1)

print (X)

<小时>

代码示例 3 的输出:

[[1.0 0.0 0.0 'Pechi']
 [0.0 0.0 1.0 'Raj']
 [0.0 1.0 0.0 'Sri']
 [0.0 0.0 1.0 'Vignesh']]

这篇关于如何在 Sklearn 中执行 OneHotEncoding,获取值错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆