scikit 0.14 多标签指标 [英] scikit 0.14 multi label metrics

查看:45
本文介绍了scikit 0.14 多标签指标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚安装了 scikit 0.14,以便我可以探索多标签指标的改进.我通过汉明损失指标和分类报告得到了一些积极的结果,但无法使混淆矩阵起作用.同样在分类报告上,我无法传递标签数组并在报告中打印标签.下面是代码.我做错了什么还是仍在开发中?

将 numpy 导入为 np将熊猫导入为 pd随机导入从 sklearn 导入数据集从 sklearn.pipeline 导入管道从 sklearn.multiclass 导入 OneVsOneClassifier从 sklearn.multiclass 导入 OneVsRestClassifier从 sklearn.svm 导入 LinearSVC从 sklearn.feature_extraction.text 导入 CountVectorizer从 sklearn.feature_extraction.text 导入 TfidfTransformertarget_names = ['纽约','伦敦','DC']X_train = np.array(["纽约是一个地狱般的城市",纽约最初是荷兰的",大苹果很棒",纽约也被称为大苹果","纽约很好",人们将纽约市缩写为 nyc",英国的首都是伦敦",伦敦在英国",伦敦是在英格兰",伦敦在英国",伦敦下很多雨",伦敦是大英博物馆的所在地",纽约很棒,伦敦也很棒",比起纽约,我更喜欢伦敦",DC是国家首都",DC 环城公路之家",奥巴马总统住在华盛顿",华盛顿纪念碑是华盛顿特区"])y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[1,0],[1,0],[2],[2],[2],[2]]X_test = np.array(['纽约的美好一天','欢迎来到伦敦','你好,欢迎来到新 ybrk.在这里和伦敦也很享受','华盛顿红皮人住在哪个城市?'])y_test = [[0],[1],[0,1],[2]]分类器 = 管道([('vectorizer', CountVectorizer(stop_words='english',ngram_range=(1,3),max_df = 1.0,min_df = 0.1,分析器='字')),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(LinearSVC()))])分类器.fit(X_train,y_train)预测 = 分类器.预测(X_test)打印预测对于项目,zip 中的标签(X_test,预测):打印 '%s =>%s' % (item, ', '.join(target_names[x] for x in label))从 sklearn.metrics 导入混淆_矩阵从 sklearn.metrics 导入分类报告从 sklearn.metrics 导入 hamming_losshl = hamming_loss(y_test, 预测, target_names)打印 " "打印 " "打印 " -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -————"打印汉明损失"打印 " "打印 hl打印 " "打印 " "打印 " -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -————"打印混淆矩阵"打印 " "厘米 = 混淆矩阵(y_test,预测)打印厘米打印 " "打印 " "打印 " -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -————"打印分类报告"打印 " "打印分类报告(y_test,预测)

解决方案

多类和多标签度量功能似乎在 2013 年 8 月 14 日发布的 0.14 版中得到了改进 - scikit-learn.org/stable/whats_new.html

此外,问题 558 似乎也解决了其中的一些问题,可能在 0.14 中,但我尚未确认这一点 - https://github.com/scikit-learn/scikit-learn/issues/558.

I just installed scikit 0.14 so that I could explore the multi-label metrics improvements. I got some positive results with the hamming loss metrics and the classification report, but was not able to get the confusion matrix to work. Also on the classification report I was unable to pass the label array and get the labels printed in the report. Below is the code. Am I doing something wrong or is this still in development?

import numpy as np
import pandas as pd
import random

from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

target_names = ['New York','London', 'DC']

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york",
                    "DC is the nations capital",
                    "DC the home of the beltway",
                    "president obama lives in Washington",
                    "The washington monument in is Washington DC"])

y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[1,0],[1,0],[2],[2],[2],[2]]


X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'hello welcome to new ybrk. enjoy it here and london too',
                   'What city does the washington redskins live in?'])
y_test = [[0],[1],[0,1],[2]]                   

classifier = Pipeline([
                       ('vectorizer', CountVectorizer(stop_words='english',
                             ngram_range=(1,3),
                             max_df = 1.0,
                             min_df = 0.1,
                             analyzer='word')),
                       ('tfidf', TfidfTransformer()),
                       ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, y_train)

predicted = classifier.predict(X_test)

print predicted


for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))



from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import hamming_loss



hl = hamming_loss(y_test, predicted, target_names)
print " "
print " "
print "---------------------------------------------------------"
print "HAMMING LOSS"
print " "
print hl

print " "
print " "
print "---------------------------------------------------------"
print "CONFUSION MATRIX"
print " "
cm = confusion_matrix(y_test, predicted)   
print cm

print " "
print " "
print "---------------------------------------------------------"
print "CLASSIFICATION REPORT"
print " "
print classification_report(y_test, predicted)

解决方案

Multiclass and multilable metric capabilities seem to have been improved in version 0.14 published on August 14, 2013 - scikit-learn.org/stable/whats_new.html

Also, issue 558 seems to address some of this as well and is probably in 0.14 but i have not yet confirmed this - https://github.com/scikit-learn/scikit-learn/issues/558.

这篇关于scikit 0.14 多标签指标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆