为分类数据创建假人 [英] Create dummies for categorical data
问题描述
我正在尝试建立一个Binary分类器,我的大多数变量都是分类的.因此,我想将分类数据处理为虚拟变量. 我有以下数据集:
I'm trying to build a Binary classifier, most of my variables are categorical. Hence I want to process categorical data into dummy vars. I have the following dataset:
ruri object
ruri_user object
ruri_domain object
from_user object
from_domain object
from_tag object
to_user object
contact_user object
callid object
content_type object
user_agent object
source_ip object
source_port int64
destination_port int64
contact_ip object
contact_port int64
toll_fraud int64
在16个功能中,我将仅选择10个功能
I will pick only few features 10 out of 16:
def select_features(self, data):
"""Selects the features that we'll use in the model. Drops unused features"""
features = ['ruri',
'ruri_user',
'ruri_domain',
'from_user',
'from_domain',
'from_tag',
'to_user',
'contact_user',
'callid',
'content_type',
'user_agent',
'source_ip',
'source_port',
'destination_port',
'contact_ip',
'contact_port']
dropped_features = ['ruri', 'ruri_domain', 'callid', 'from_tag', 'content_type', 'from_user']
target = ['toll_fraud']
X = data[features].drop(dropped_features, axis=1)
y = data[target]
return X, y
我将数据集分为训练和测试数据.最初,这两个子集具有相同数量的特征,并且将我的特征转换为分类后,我的变量数量发生了变化,因此无法处理模型.
I split my dataset into training and test data. Initially both subsets have the same number of features, and after converting my features to categorical my number of variables change, hence is impossible to process model.
在create_dummies之前:
Before create_dummies:
1665 10
555 10
create_dummies之后:
After create_dummies:
1665 1564
555 765
我在这里创建假人:
def create_dummies(self, data, cat_vars, cat_types):
"""Processes categorical data into dummy vars."""
cat_data = data[cat_vars].values
for i in range(len(cat_vars)):
bins = LabelBinarizer().fit_transform(cat_data[:, 0].astype(cat_types[i]))
cat_data = np.delete(cat_data, 0, axis=1)
cat_data = np.column_stack((cat_data, bins))
return cat_data
def preproc(self):
"""Executes the full preprocessing pipeline."""
# Import Data & Split.
X_train_, y_train, X_valid_, y_valid = self.import_and_split_data()
# Fill NAs.
X_train, X_valid = self.fix_na(X_train_), self.fix_na(X_valid_)
# Preproc Categorical Vars
cat_vars = ['ruri_user',
'from_domain',
'to_user',
'contact_user',
'user_agent',
'source_ip',
'contact_ip']
cat_types = ['str', 'str', 'str', 'str', 'str', 'str', 'str']
print 'Before create_dummies'
print X_train.shape[0], X_train.shape[1]
print X_valid.shape[0], X_valid.shape[1]
X_train_cat, X_valid_cat = self.create_dummies(X_train, cat_vars, cat_types), self.create_dummies(X_valid,
cat_vars,
cat_types)
print 'After create_dummies'
print X_train_cat.shape[0], X_train_cat.shape[1]
print X_valid_cat.shape[0], X_valid_cat.shape[1]
X_train, X_valid = X_train_cat, X_valid_cat
print 'After assignment'
print X_train.shape[0], X_train.shape[1]
print X_valid.shape[0], X_valid.shape[1]
return X_train.astype('float32'), y_train.values, X_valid.astype('float32'), y_valid.values
完整代码此处
数据集此处
来自此处
Original Code from here
推荐答案
将数据框分为训练集和测试集时,某些类别进入训练集而不是测试集中,这就是为什么您与众不同的原因 火车和测试仪的形状!
When you split your dataframe into train and test set some categories goes in train set and not in test set, that why you are getting different shapes for your train and test set!
如评论中所建议,您需要先进行所有预处理,然后再拆分为训练集和测试集. 不需要分别进行训练和测试的预处理.
As suggested in the comment you need to do all preprocessing before splitting into train and test sets. Don't need to do preprocessing of train and test separately.
您将对所有可能的类别进行编码,然后进行拆分
You will get all possibles categories encoded and then you can split
这篇关于为分类数据创建假人的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!