多类分类问题中的类不平衡 [英] Imbalanced classes in multi-class classification problem
问题描述
我正在尝试将TensorFlow的DNNClassifier用于我的带有4个不同类的多类(softmax)分类问题.我的数据集不平衡,分布如下:
I'm trying to use TensorFlow's DNNClassifier for my multi-class (softmax) classification problem with 4 different classes. I have an imbalanced dataset with the following distribution:
- 0级:14.8%
- 第1类:35.2%
- 第2类:27.8%
- 第3类:22.2%
如何为每个类分配DNNClassifier的weight_column
的权重?我知道如何编写代码,但是我想知道应该为每个类提供什么值.
How do I assign the weights for the DNNClassifier's weight_column
for each class? I know how to code this, but I am wondering what values should I give for each class.
推荐答案
对于不平衡分类问题,有多种方法可以构建权重.最常见的方法之一是直接使用训练中的班级计数来估计样本权重.通过 sklearn 可以轻松计算此选项. 平衡"模式使用y的值来自动调整与班级频率成反比的权重.
there are various options to build weights for un unbalance classification problems. one of the most common is to use directly the class counts in train to estimate sample weights. this option is easily computed by sklearn. The 'balanced' mode uses the values of y to automatically adjust weights inversely proportional to class frequencies.
在下面的示例中,我们尝试做的是整合" compute_sample_weight
方法以适合我们的DNNClassifier.作为标签分配,我使用了问题中表达的内容
what we try to do in the example below is to 'incorporate' the compute_sample_weight
method in fitting our DNNClassifier. as label distribution, I used the same expressed in the question
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.utils.class_weight import compute_sample_weight
train_size = 1000
test_size = 200
columns = 30
## create train data
y_train = np.random.choice([0,1,2,3], train_size, p=[0.15, 0.35, 0.28, 0.22])
x_train = pd.DataFrame(np.random.uniform(0,1, (train_size,columns)).astype('float32'))
x_train.columns = [str(i) for i in range(columns)]
## create train weights
weight = compute_sample_weight(class_weight='balanced', y=y_train)
x_train['weight'] = weight.astype('float32')
## create test data
y_test = np.random.choice([0,1,2,3], test_size, p=[0.15, 0.35, 0.28, 0.22])
x_test = pd.DataFrame(np.random.uniform(0,1, (test_size,columns)).astype('float32'))
x_test.columns = [str(i) for i in range(columns)]
## create test weights
x_test['weight'] = np.ones(len(y_test)).astype('float32') ## set them all to 1
## utility functions to pass data to DNNClassifier
def train_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((dict(x_train), y_train))
dataset = dataset.shuffle(1000).repeat().batch(10)
return dataset
def eval_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((dict(x_test), y_test))
return dataset.shuffle(1000).repeat().batch(10)
## define DNNClassifier
classifier = tf.estimator.DNNClassifier(
feature_columns=[tf.feature_column.numeric_column(str(i), shape=[1]) for i in range(columns)],
weight_column = tf.feature_column.numeric_column('weight'),
hidden_units=[10],
n_classes=4,
)
## train DNNClassifier
classifier.train(input_fn=lambda: train_input_fn(), steps=100)
## make evaluation
eval_results = classifier.evaluate(input_fn=eval_input_fn, steps=1)
考虑到我们的权重是根据目标确定的,因此我们必须在测试数据中将其设置为1,因为标签是未知的.
considering that our weights are built as a function of the target we have to set them to 1 in our test data because the labels are unknown.
这篇关于多类分类问题中的类不平衡的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!