使用scikit-learn进行二次采样+分类 [英] Subsampling + classifying using scikit-learn

查看:148
本文介绍了使用scikit-learn进行二次采样+分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Scikit-learn进行二进制分类任务..我有: 第0类:有200个观察结果 第1类:有50个观察结果

I am using Scikit-learn for a binary classification task.. and I have: Class 0: with 200 observations Class 1: with 50 observations

并且因为我有一个不平衡的数据,所以我想获取一个多数类的随机子样本,其中观察数将与少数类相同,并且想要使用新获得的数据集作为分类器的输入..二次抽样和分类的过程可以重复很多次..主要在 Ami Tavory

And because I have an unbalanced data.. I want to take a random subsample of the majority class where the number of observations will be the same as the minority class and want to use the new obtained dataset as an input to the classifier .. the process of subsampling and classifying can be repeated many times .. I've the following code for the subsampling with mainly the help of Ami Tavory

docs_train=load_files(rootdir,categories=categories, encoding='latin-1')

X_train = docs_train.data
y_train = docs_train.target

majority_x,majority_y=x[y==0,:],y[y==0]  # assuming that class 0 is the majority class
minority_x,minority_y=x[y==1,:],y[y==1]

inds=np.random.choice(range(majority_x.shape[0]),50)
majority_x=majority_x[inds,:]
majority_y=majority_y[inds]

它就像一个符咒,但是,在处理most_x和most_y结束时,我希望能够用新的较小集合替换表示X_train,y_train中的class0的旧集合,以将其按如下方式传递给分类器或管道:

It works like a charm, however, at the end of processing the majority_x and majority_y I want to be able to replace the old set that represent class0 in X_train, y_train with the new smaller set in order to pass it as follow to the classifier or the pipeline:

pipeline = Pipeline([
    ('vectorizer',  CountVectorizer( tokenizer=tokens, binary=True)),
    ('classifier',SVC(C=1,kernel='linear')) ])

pipeline.fit(X_train, y_train)

为了解决这个问题,我做了什么: 因为结果数组是numpy数组,并且因为我是整个领域的新手,所以我真的很努力地学习..我尝试将两个结果数组组合在一起most_x + minority_x以形成训练数据,我要..我不能直到现在为止都在尝试解决一些错误...但是,即使我可以..我也要如何保持它们的索引,以使多数数y和少数数y也为真!

What I have done In order to solve this: since the resulted arrays where numpy arrays, and because I am new to the whole area and I am really trying very hard to learn .. I've tried to combine the two resulted arrays together majority_x+minority_x in order to form the training data that I want .. I couldn't it gave some errors which I am trying to solve until now ... but even if I could .. how can I keep their index so the majority_y and minority_y will be true as well !

推荐答案

在处理了major_x和minor_y之后,您可以将训练集与

After processing majority_x and minority_y you can merge your training sets with

X_train = np.concatenate((majority_x,minority_x))
y_train = np.concatenate((majority_y,minority_y))

现在X_train和y_train将首先包含y = 0的选定样本,然后是y = 1的样本.

Now X_train and y_train will first contain the chosen samples with y=0 and then the samples with y=1.

与您的相关问题有关的想法: 通过创建一个多数数样本长度的随机排列向量,来选择多数样本. 然后选择该向量的前50个索引,然后选择下一个50,依此类推. 当您处理完该向量时,每个样本将被选择一次. 如果您希望进行更多的迭代,或者剩余的排列向量太短,则可以求助于随机选择.

An idea for your related question: Make your choice of the majority samples by creating a random permutation vector of the length of the number of your majority samples. Then choose the first 50 indices of that vector, then the next 50 and so on. When you are through with that vector, each sample will have been chosen exactly once. If you want more iterations or the remaining permutation vector is too short, you can resort back to random choice.

正如我在评论中提到的那样,您可能还需要在np.random.choice中添加参数"replace = False", if 您要避免在一次迭代中多次使用相同的样本.

As I mentioned in my comment, you might also want to add the parameter "replace=False" in your np.random.choice, if you want to prevent having the same sample multiple times in one iteration.

这篇关于使用scikit-learn进行二次采样+分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆