使用 sklearn 进行多类、多标签、序数分类 [英] Multi-class, multi-label, ordinal classification with sklearn

查看:61
本文介绍了使用 sklearn 进行多类、多标签、序数分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何使用 sklearn 运行多类、多标签、序数分类.我想预测目标群体的排名,范围从某个位置最流行的人群 (1) 到最不流行的人群 (7).我似乎无法做到正确.你能帮我吗?

I was wondering how to run a multi-class, multi-label, ordinal classification with sklearn. I want to predict a ranking of target groups, ranging from the one that is most prevalant at a certain location (1) to the one that is least prevalent (7). I don't seem to be able to get it right. Could you please help me out?


# Random Forest Classification

# Import
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Import dataset
dataset = pd.read_excel('alle_probs_edit.v2.xlsx')
X = dataset.iloc[:,4:-1].values
Y = dataset.iloc[:,-1].values

# Split in Train and Test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42 )

# Scaling the features (alle Variablen auf eine gleiche Ebene), necessary depend on the choosen method
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

# Creat classifier
classifier =  RandomForestClassifier(criterion = 'entropy')

# Choose some parameter combinations to try
parameters = {'bootstrap': [True, False],
 'max_depth': [50],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 3, 4],
 'min_samples_split': [9, 10, 11, 12, 13],
 'n_estimators': [500,1000,1500]}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)

# Run the grid search
grid_obj = GridSearchCV(classifier, parameters, scoring=acc_scorer, cv = 3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, Y_train)

# Set the classifier to the best combination of parameters
classifier = grid_obj.best_estimator_

# Fit the best algorithm to the data
classifier.fit(X_train, Y_train)

#Prediction the Test data
Y_pred = classifier.predict(X_test)

#Confusion Matrix
cm = pd.DataFrame(confusion_matrix(Y_test, Y_pred))

#Accuracy
accuracy1 = accuracy_score(Y_test, Y_pred)
print("Accuracy1: %.2f%%" % (accuracy1 * 100.0))

# k-Fold Class Validation
accuracy1 = cross_val_score(estimator = classifier, X = X_train, y = Y_train, cv = 10)
kfold = accuracy1.mean()
accuracy1.std()

推荐答案

这可能不是您正在寻找的准确答案,这篇文章概述了一种技术如下:

This may not be the precise answer you're looking for, this article outlines a technique as follows:

我们可以通过将 k 类序数回归问题转换为 k-1 二元分类问题来利用有序类值,我们将序数属性 A* 与序数值 V1, V2, V3, ... Vk 转换为 k-1 个二元属性,一个用于每个原始属性的前 k-1 个值.第 i 个二元属性表示测试 A* > Vi

We can take advantage of the ordered class value by transforming a k-class ordinal regression problem to a k-1 binary classification problem, we convert an ordinal attribute A* with ordinal value V1, V2, V3, … Vk into k-1 binary attributes, one for each of the original attribute’s first k − 1 values. The ith binary attribute represents the test A* > Vi

本质上,聚合多个二元分类器(预测目标> 1,目标> 2,目标> 3,目标> 4)能够预测目标是1、2、3、4还是5.作者创建了一个OrdinalClassifier 类,在 Python 字典中存储多个二元分类器.

Essentially, aggregate multiple binary classifiers (predict target > 1, target > 2, target > 3, target > 4) to be able to predict whether a target is 1, 2, 3, 4 or 5. The author creates an OrdinalClassifier class that stores multiple binary classifiers in a Python dictionary.

class OrdinalClassifier():

    def __init__(self, clf):
        self.clf = clf
        self.clfs = {}

    def fit(self, X, y):
        self.unique_class = np.sort(np.unique(y))
        if self.unique_class.shape[0] > 2:
            for i in range(self.unique_class.shape[0]-1):
                # for each k - 1 ordinal value we fit a binary classification problem
                binary_y = (y > self.unique_class[i]).astype(np.uint8)
                clf = clone(self.clf)
                clf.fit(X, binary_y)
                self.clfs[i] = clf

    def predict_proba(self, X):
        clfs_predict = {k:self.clfs[k].predict_proba(X) for k in self.clfs}
        predicted = []
        for i,y in enumerate(self.unique_class):
            if i == 0:
                # V1 = 1 - Pr(y > V1)
                predicted.append(1 - clfs_predict[y][:,1])
            elif y in clfs_predict:
                # Vi = Pr(y > Vi-1) - Pr(y > Vi)
                 predicted.append(clfs_predict[y-1][:,1] - clfs_predict[y][:,1])
            else:
                # Vk = Pr(y > Vk-1)
                predicted.append(clfs_predict[y-1][:,1])
        return np.vstack(predicted).T

    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)

该技术源自一种简单的序数分类方法

这篇关于使用 sklearn 进行多类、多标签、序数分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆