为分类数据创建假人 [英] Create dummies for categorical data

查看:82
本文介绍了为分类数据创建假人的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试建立一个Binary分类器,我的大多数变量都是分类的.因此,我想将分类数据处理为虚拟变量. 我有以下数据集:

I'm trying to build a Binary classifier, most of my variables are categorical. Hence I want to process categorical data into dummy vars. I have the following dataset:

ruri                object
ruri_user           object
ruri_domain         object
from_user           object
from_domain         object
from_tag            object
to_user             object
contact_user        object
callid              object
content_type        object
user_agent          object
source_ip           object
source_port          int64
destination_port     int64
contact_ip          object
contact_port         int64
toll_fraud           int64

在16个功能中,我将仅选择10个功能

I will pick only few features 10 out of 16:

def select_features(self, data):
        """Selects the features that we'll use in the model. Drops unused features"""
        features = ['ruri', 
                    'ruri_user', 
                    'ruri_domain', 
                    'from_user', 
                    'from_domain', 
                    'from_tag', 
                    'to_user',
                    'contact_user', 
                    'callid', 
                    'content_type', 
                    'user_agent', 
                    'source_ip', 
                    'source_port',
                    'destination_port', 
                    'contact_ip', 
                    'contact_port']
        dropped_features = ['ruri', 'ruri_domain', 'callid', 'from_tag', 'content_type', 'from_user']
        target = ['toll_fraud']
        X = data[features].drop(dropped_features, axis=1)
        y = data[target]
        return X, y

我将数据集分为训练和测试数据.最初,这两个子集具有相同数量的特征,并且将我的特征转换为分类后,我的变量数量发生了变化,因此无法处理模型.

I split my dataset into training and test data. Initially both subsets have the same number of features, and after converting my features to categorical my number of variables change, hence is impossible to process model.

在create_dummies之前:

Before create_dummies:

1665 10
555 10

create_dummies之后:

After create_dummies:

1665 1564
555 765

我在这里创建假人:

def create_dummies(self, data, cat_vars, cat_types):
        """Processes categorical data into dummy vars."""

        cat_data = data[cat_vars].values
        for i in range(len(cat_vars)):
            bins = LabelBinarizer().fit_transform(cat_data[:, 0].astype(cat_types[i]))
            cat_data = np.delete(cat_data, 0, axis=1)
            cat_data = np.column_stack((cat_data, bins))
        return cat_data


def preproc(self):
        """Executes the full preprocessing pipeline."""

        # Import Data & Split.
        X_train_, y_train, X_valid_, y_valid = self.import_and_split_data()
        # Fill NAs.
        X_train, X_valid = self.fix_na(X_train_), self.fix_na(X_valid_)
        # Preproc Categorical Vars
        cat_vars = ['ruri_user',
                    'from_domain',
                    'to_user',
                    'contact_user',
                    'user_agent',
                    'source_ip',
                    'contact_ip']

        cat_types = ['str', 'str', 'str', 'str', 'str', 'str', 'str']
        print 'Before create_dummies'
        print X_train.shape[0], X_train.shape[1]
        print X_valid.shape[0], X_valid.shape[1]

        X_train_cat, X_valid_cat = self.create_dummies(X_train, cat_vars, cat_types), self.create_dummies(X_valid,
                                                                                                          cat_vars,
                                                                                                          cat_types)

        print 'After create_dummies'
        print X_train_cat.shape[0], X_train_cat.shape[1]
        print X_valid_cat.shape[0], X_valid_cat.shape[1]

        X_train, X_valid = X_train_cat, X_valid_cat
        print 'After assignment'
        print X_train.shape[0], X_train.shape[1]
        print X_valid.shape[0], X_valid.shape[1]

        return X_train.astype('float32'), y_train.values, X_valid.astype('float32'), y_valid.values

完整代码此处

数据集此处

来自此处

Original Code from here

推荐答案

将数据框分为训练集和测试集时,某些类别进入训练集而不是测试集中,这就是为什么您与众不同的原因 火车和测试仪的形状!

When you split your dataframe into train and test set some categories goes in train set and not in test set, that why you are getting different shapes for your train and test set!

如评论中所建议,您需要先进行所有预处理,然后再拆分为训练集和测试集. 不需要分别进行训练和测试的预处理.

As suggested in the comment you need to do all preprocessing before splitting into train and test sets. Don't need to do preprocessing of train and test separately.

您将对所有可能的类别进行编码,然后进行拆分

You will get all possibles categories encoded and then you can split

这篇关于为分类数据创建假人的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆