如何在 Python 中进行一次热编码? [英] How can I one hot encode in Python?

查看:31
本文介绍了如何在 Python 中进行一次热编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 80% 分类变量的机器学习分类问题.如果我想使用某个分类器进行分类,我必须使用一种热编码吗?我可以将数据传递给没有编码的分类器吗?

我正在尝试执行以下功能选择:

  1. 我阅读了火车文件:

    num_rows_to_read = 10000train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read)

  2. 我将分类特征的类型更改为类别":

    non_categorial_features = ['orig_destination_distance','srch_adults_cnt','srch_children_cnt','srch_rm_cnt','cn']对于列表中的 categorical_feature(train_small.columns):如果 categorical_feature 不在 non_categorial_features 中:train_small[categorical_feature] = train_small[categorical_feature].astype('category')

  3. 我使用一种热编码:

    train_small_with_dummies = pd.get_dummies(train_small, sparse=True)

问题是第三部分经常卡住,尽管我使用的是强大的机器.

因此,如果没有一种热编码,我将无法进行任何特征选择,以确定特征的重要性.

你有什么推荐?

解决方案

方法一:可以使用pandas的pd.get_dummies.

示例 1:

将pandas导入为pds = pd.Series(list('abca'))pd.get_dummies(s)出去[]:a b c0 1.0 0.0 0.01 0.0 1.0 0.02 0.0 0.0 1.03 1.0 0.0 0.0

示例 2:

以下内容会将给定的列转换为一个热点.使用前缀可以有多个假人.

将pandas导入为pddf = pd.DataFrame({'A':['a','b','a'],'B':['b','a','c']})df出去[]:甲乙0 a b1 b2 a c# 获取 B 列的一种热编码one_hot = pd.get_dummies(df['B'])# 删除 B 列,因为它现在已编码df = df.drop('B',axis = 1)# 加入编码后的dfdf = df.join(one_hot)df出去[]:a b c0 0 1 01 1 0 02 0 0 1

方法 2:使用 Scikit-learn

使用 OneHotEncoder 的优点是能够fit 一些训练数据,然后使用相同的实例transform 一些其他数据.我们还有 handle_unknown 来进一步控制编码器对未见数据的处理.

给定具有三个特征和四个样本的数据集,我们让编码器找到每个特征的最大值并将数据转换为二进制 one-hot 编码.

<预><代码>>>>从 sklearn.preprocessing 导入 OneHotEncoder>>>enc = OneHotEncoder()>>>enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])OneHotEncoder(categorical_features='all', dtype=,handle_unknown='error', n_values='auto', sparse=True)>>>enc.n_values_数组([2, 3, 4])>>>enc.feature_indices_数组([0, 2, 5, 9], dtype=int32)>>>enc.transform([[0, 1, 1]]).toarray()数组([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]])

这是这个例子的链接:http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?

I am trying to do the following for feature selection:

  1. I read the train file:

    num_rows_to_read = 10000
    train_small = pd.read_csv("../../dataset/train.csv",   nrows=num_rows_to_read)
    

  2. I change the type of the categorical features to 'category':

    non_categorial_features = ['orig_destination_distance',
                              'srch_adults_cnt',
                              'srch_children_cnt',
                              'srch_rm_cnt',
                              'cnt']
    
    for categorical_feature in list(train_small.columns):
        if categorical_feature not in non_categorial_features:
            train_small[categorical_feature] = train_small[categorical_feature].astype('category')
    

  3. I use one hot encoding:

    train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
    

The problem is that the 3'rd part often get stuck, although I am using a strong machine.

Thus, without the one hot encoding I can't do any feature selection, for determining the importance of the features.

What do you recommend?

解决方案

Approach 1: You can use pandas' pd.get_dummies.

Example 1:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]: 
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
3  1.0  0.0  0.0

Example 2:

The following will transform a given column into one hot. Use prefix to have multiple dummies.

import pandas as pd
        
df = pd.DataFrame({
          'A':['a','b','a'],
          'B':['b','a','c']
        })
df
Out[]: 
   A  B
0  a  b
1  b  a
2  a  c

# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df  
Out[]: 
       A  a  b  c
    0  a  0  1  0
    1  b  1  0  0
    2  a  0  0  1

Approach 2: Use Scikit-learn

Using a OneHotEncoder has the advantage of being able to fit on some training data and then transform on some other data using the same instance. We also have handle_unknown to further control what the encoder does with unseen data.

Given a dataset with three features and four samples, we let the encoder find the maximum value per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])   
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

Here is the link for this example: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

这篇关于如何在 Python 中进行一次热编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆