在scikit-learn中获得多标签预测的准确性 [英] Getting the accuracy for multi-label prediction in scikit-learn

查看:93
本文介绍了在scikit-learn中获得多标签预测的准确性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有什么方法可以得到另一种典型的方法来计算scikit-learn中的准确性,

(如(1)和(2)所定义,并不太明确地称为 Hamming得分(4)(因为它与汉明损失密切相关)或基于标签准确性)?


(1)Morowmad S. Sorower." A多标签学习算法的文献综述.俄勒冈州立大学,科瓦利斯分校(2010年).

(2)Tsoumakas,Grigorios和Ioannis Katakis."多标签分类:概述.希腊塞萨洛尼基亚里斯多德大学信息学系(2006年).

(3)Ghamrawi,Nadia和Andrew McCallum."集体多标签分类."会议记录第14届ACM国际信息和知识管理会议.ACM,2005年.

(4)Godbole,Shantanu和Sunita Sarawagi."用于多标签分类的判别方法.知识发现和数据挖掘的进步.施普林格·柏林·海德堡,2004年.22-30.

解决方案

您可以自己编写一个版本,这是一个示例,无需考虑权重和规范化.

 将numpy导入为npy_true = np.array([[0,1,0],[0,1,1],[1,0,1],[0,0,1]])y_pred = np.array([[0,1,1],[0,1,1],[0,1,0],[0,0,0]])def hamming_score(y_true,y_pred,normalize = True,sample_weight = None):'''计算多标签案例的汉明分数(又称基于标签的准确性)http://stackoverflow.com/q/32239577/395857'''acc_list = []对于我在范围内(y_true.shape [0]):set_true = set(np.where(y_true [i])[0])set_pred = set(np.where(y_pred [i])[0])#print('\ nset_true:{0}'.format(set_true))#print('set_pred:{0}'.format(set_pred))tmp_a =无如果len(set_true)== 0和len(set_pred)== 0:tmp_a = 1别的:tmp_a = len(set_true.intersection(set_pred))/\float(len(set_true.union(set_pred)))#print('tmp_a:{0}'.format(tmp_a))acc_list.append(tmp_a)返回np.mean(acc_list)如果__name__ =="__main__":print('汉明分数:{0}'.format(hamming_score(y_true,y_pred)))#0.375(=(0.5 + 1 + 0 + 0)/4)#为了比较起见:导入sklearn.metrics#子集精度#0.25(= 0 + 1 + 0 + 0/4)->如果一个样本的预测与黄金完全匹配,则为1.否则为0.print('子集精度:{0}'.format(sklearn.metrics.accuracy_score(y_true,y_pred,normalize = True,sample_weight = None)))#汉明损失(越小越好)#$$ \ text {HammingLoss}(x_i,y_i)= \ frac {1} {| D |} \ sum_ {i = 1} ^ {| D |} \ frac {xor(x_i,y_i)} {| L|},$$# 在哪里#-\\(| D | \\)是样本数#-\\(| L | \\)是标签数#-\\(y_i \\)是基本事实#-\\(x_i \\)是预测.#0.416666666667(=(1 + 0 + 3 + 1)/(3 * 4))print('汉明损失:{0}'.format(sklearn.metrics.hamming_loss(y_true,y_pred))) 

输出:

 击剑得分:0.375子集精度:0.25汉明损失:0.416666666667 

In a multilabel classification setting, sklearn.metrics.accuracy_score only computes the subset accuracy (3): i.e. the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

This way of computing the accuracy is sometime named, perhaps less ambiguously, exact match ratio (1):

Is there any way to get the other typical way to compute the accuracy in scikit-learn, namely

(as defined in (1) and (2), and less ambiguously referred to as the Hamming score (4) (since it is closely related to the Hamming loss), or label-based accuracy) ?


(1) Sorower, Mohammad S. "A literature survey on algorithms for multi-label learning." Oregon State University, Corvallis (2010).

(2) Tsoumakas, Grigorios, and Ioannis Katakis. "Multi-label classification: An overview." Dept. of Informatics, Aristotle University of Thessaloniki, Greece (2006).

(3) Ghamrawi, Nadia, and Andrew McCallum. "Collective multi-label classification." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005.

(4) Godbole, Shantanu, and Sunita Sarawagi. "Discriminative methods for multi-labeled classification." Advances in Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2004. 22-30.

解决方案

You can write one version yourself, here is a example without considering the weight and normalize.

import numpy as np

y_true = np.array([[0,1,0],
                   [0,1,1],
                   [1,0,1],
                   [0,0,1]])

y_pred = np.array([[0,1,1],
                   [0,1,1],
                   [0,1,0],
                   [0,0,0]])

def hamming_score(y_true, y_pred, normalize=True, sample_weight=None):
    '''
    Compute the Hamming score (a.k.a. label-based accuracy) for the multi-label case
    http://stackoverflow.com/q/32239577/395857
    '''
    acc_list = []
    for i in range(y_true.shape[0]):
        set_true = set( np.where(y_true[i])[0] )
        set_pred = set( np.where(y_pred[i])[0] )
        #print('\nset_true: {0}'.format(set_true))
        #print('set_pred: {0}'.format(set_pred))
        tmp_a = None
        if len(set_true) == 0 and len(set_pred) == 0:
            tmp_a = 1
        else:
            tmp_a = len(set_true.intersection(set_pred))/\
                    float( len(set_true.union(set_pred)) )
        #print('tmp_a: {0}'.format(tmp_a))
        acc_list.append(tmp_a)
    return np.mean(acc_list)

if __name__ == "__main__":
    print('Hamming score: {0}'.format(hamming_score(y_true, y_pred))) # 0.375 (= (0.5+1+0+0)/4)

    # For comparison sake:
    import sklearn.metrics

    # Subset accuracy
    # 0.25 (= 0+1+0+0 / 4) --> 1 if the prediction for one sample fully matches the gold. 0 otherwise.
    print('Subset accuracy: {0}'.format(sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)))

    # Hamming loss (smaller is better)
    # $$ \text{HammingLoss}(x_i, y_i) = \frac{1}{|D|} \sum_{i=1}^{|D|} \frac{xor(x_i, y_i)}{|L|}, $$
    # where
    #  - \\(|D|\\) is the number of samples  
    #  - \\(|L|\\) is the number of labels  
    #  - \\(y_i\\) is the ground truth  
    #  - \\(x_i\\)  is the prediction.  
    # 0.416666666667 (= (1+0+3+1) / (3*4) )
    print('Hamming loss: {0}'.format(sklearn.metrics.hamming_loss(y_true, y_pred))) 

Outputs:

Hamming score: 0.375
Subset accuracy: 0.25
Hamming loss: 0.416666666667

这篇关于在scikit-learn中获得多标签预测的准确性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆