为分类数据创建假人 [英] Create dummies for categorical data

查看：82 发布时间：2020/5/24 4:04:08 python pandas

本文介绍了为分类数据创建假人的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试建立一个Binary分类器，我的大多数变量都是分类的.因此，我想将分类数据处理为虚拟变量. 我有以下数据集:

I'm trying to build a Binary classifier, most of my variables are categorical. Hence I want to process categorical data into dummy vars. I have the following dataset:

ruri                object
ruri_user           object
ruri_domain         object
from_user           object
from_domain         object
from_tag            object
to_user             object
contact_user        object
callid              object
content_type        object
user_agent          object
source_ip           object
source_port          int64
destination_port     int64
contact_ip          object
contact_port         int64
toll_fraud           int64

在16个功能中，我将仅选择10个功能

I will pick only few features 10 out of 16:

def select_features(self, data):
        """Selects the features that we'll use in the model. Drops unused features"""
        features = ['ruri', 
                    'ruri_user', 
                    'ruri_domain', 
                    'from_user', 
                    'from_domain', 
                    'from_tag', 
                    'to_user',
                    'contact_user', 
                    'callid', 
                    'content_type', 
                    'user_agent', 
                    'source_ip', 
                    'source_port',
                    'destination_port', 
                    'contact_ip', 
                    'contact_port']
        dropped_features = ['ruri', 'ruri_domain', 'callid', 'from_tag', 'content_type', 'from_user']
        target = ['toll_fraud']
        X = data[features].drop(dropped_features, axis=1)
        y = data[target]
        return X, y

我将数据集分为训练和测试数据.最初，这两个子集具有相同数量的特征，并且将我的特征转换为分类后，我的变量数量发生了变化，因此无法处理模型.

I split my dataset into training and test data. Initially both subsets have the same number of features, and after converting my features to categorical my number of variables change, hence is impossible to process model.

在create_dummies之前:

Before create_dummies:

1665 10
555 10

create_dummies之后:

After create_dummies:

1665 1564
555 765

我在这里创建假人:

def create_dummies(self, data, cat_vars, cat_types):
        """Processes categorical data into dummy vars."""

        cat_data = data[cat_vars].values
        for i in range(len(cat_vars)):
            bins = LabelBinarizer().fit_transform(cat_data[:, 0].astype(cat_types[i]))
            cat_data = np.delete(cat_data, 0, axis=1)
            cat_data = np.column_stack((cat_data, bins))
        return cat_data


def preproc(self):
        """Executes the full preprocessing pipeline."""

        # Import Data & Split.
        X_train_, y_train, X_valid_, y_valid = self.import_and_split_data()
        # Fill NAs.
        X_train, X_valid = self.fix_na(X_train_), self.fix_na(X_valid_)
        # Preproc Categorical Vars
        cat_vars = ['ruri_user',
                    'from_domain',
                    'to_user',
                    'contact_user',
                    'user_agent',
                    'source_ip',
                    'contact_ip']

        cat_types = ['str', 'str', 'str', 'str', 'str', 'str', 'str']
        print 'Before create_dummies'
        print X_train.shape[0], X_train.shape[1]
        print X_valid.shape[0], X_valid.shape[1]

        X_train_cat, X_valid_cat = self.create_dummies(X_train, cat_vars, cat_types), self.create_dummies(X_valid,
                                                                                                          cat_vars,
                                                                                                          cat_types)

        print 'After create_dummies'
        print X_train_cat.shape[0], X_train_cat.shape[1]
        print X_valid_cat.shape[0], X_valid_cat.shape[1]

        X_train, X_valid = X_train_cat, X_valid_cat
        print 'After assignment'
        print X_train.shape[0], X_train.shape[1]
        print X_valid.shape[0], X_valid.shape[1]

        return X_train.astype('float32'), y_train.values, X_valid.astype('float32'), y_valid.values

完整代码此处

数据集此处

来自此处

Original Code from here

为分类数据创建假人 [英] Create dummies for categorical data

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为分类数据创建假人 [英] Create dummies for categorical data

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭