多类分类问题中的类不平衡 [英] Imbalanced classes in multi-class classification problem

查看:94
本文介绍了多类分类问题中的类不平衡的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将TensorFlow的DNNClassifier用于我的带有4个不同类的多类(softmax)分类问题.我的数据集不平衡,分布如下:

I'm trying to use TensorFlow's DNNClassifier for my multi-class (softmax) classification problem with 4 different classes. I have an imbalanced dataset with the following distribution:

  • 0级:14.8%
  • 第1类:35.2%
  • 第2类:27.8%
  • 第3类:22.2%

如何为每个类分配DNNClassifier的weight_column的权重?我知道如何编写代码,但是我想知道应该为每个类提供什么值.

How do I assign the weights for the DNNClassifier's weight_column for each class? I know how to code this, but I am wondering what values should I give for each class.

推荐答案

对于不平衡分类问题,有多种方法可以构建权重.最常见的方法之一是直接使用训练中的班级计数来估计样本权重.通过 sklearn 可以轻松计算此选项. 平衡"模式使用y的值来自动调整与班级频率成反比的权重.

there are various options to build weights for un unbalance classification problems. one of the most common is to use directly the class counts in train to estimate sample weights. this option is easily computed by sklearn. The 'balanced' mode uses the values of y to automatically adjust weights inversely proportional to class frequencies.

在下面的示例中,我们尝试做的是整合" compute_sample_weight方法以适合我们的DNNClassifier.作为标签分配,我使用了问题中表达的内容

what we try to do in the example below is to 'incorporate' the compute_sample_weight method in fitting our DNNClassifier. as label distribution, I used the same expressed in the question

import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.utils.class_weight import compute_sample_weight

train_size = 1000
test_size = 200
columns = 30

## create train data
y_train = np.random.choice([0,1,2,3], train_size, p=[0.15, 0.35, 0.28, 0.22])
x_train = pd.DataFrame(np.random.uniform(0,1, (train_size,columns)).astype('float32'))
x_train.columns = [str(i) for i in range(columns)]

## create train weights
weight = compute_sample_weight(class_weight='balanced', y=y_train)
x_train['weight'] = weight.astype('float32')

## create test data
y_test = np.random.choice([0,1,2,3], test_size, p=[0.15, 0.35, 0.28, 0.22])
x_test = pd.DataFrame(np.random.uniform(0,1, (test_size,columns)).astype('float32'))
x_test.columns = [str(i) for i in range(columns)]

## create test weights
x_test['weight'] = np.ones(len(y_test)).astype('float32') ## set them all to 1

## utility functions to pass data to DNNClassifier
def train_input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((dict(x_train), y_train))
    dataset = dataset.shuffle(1000).repeat().batch(10)
    return dataset

def eval_input_fn():
    dataset = tf.data.Dataset.from_tensor_slices((dict(x_test), y_test))
    return dataset.shuffle(1000).repeat().batch(10)

## define DNNClassifier
classifier = tf.estimator.DNNClassifier(
    feature_columns=[tf.feature_column.numeric_column(str(i), shape=[1]) for i in range(columns)],
    weight_column = tf.feature_column.numeric_column('weight'),
    hidden_units=[10],
    n_classes=4,
)

## train DNNClassifier
classifier.train(input_fn=lambda: train_input_fn(), steps=100)

## make evaluation
eval_results = classifier.evaluate(input_fn=eval_input_fn, steps=1)

考虑到我们的权重是根据目标确定的,因此我们必须在测试数据中将其设置为1,因为标签是未知的.

considering that our weights are built as a function of the target we have to set them to 1 in our test data because the labels are unknown.

这篇关于多类分类问题中的类不平衡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆