从多类分类算法输出前 2 个类 [英] Output top 2 classes from a multiclass classification algorithm

查看:55
本文介绍了从多类分类算法输出前 2 个类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究 text 的多类分类问题,我有很多不同的类(15+).我已经训练了一个 Linearsvc svm 方法(方法只是一个例子).但是它只输出概率最高的单个类,有没有办法让算法同时输出两个类

I am working on a multiclass classificiation problem for text , where I have a lot of different classes (15+). I have trained a Linearsvc svm method(method is just and example). But it outputs just single class with highest probability, Is there a way that algorithm outputs two classes at the same time

我正在使用的示例代码:

sample code i am using:

from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
count_vect = CountVectorizer(max_df=.9,min_df=.002,  
                             encoding='latin-1', 
                             ngram_range=(1, 3))
X_train_counts = count_vect.fit_transform(df_upsampled['text'])
tfidf_transformer = TfidfTransformer(sublinear_tf=True,norm='l2')
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = LinearSVC().fit(X_train_tfidf, df_upsampled['reason'])
y_pred = model.predict(X_test)

电流输出:

    source  user   time    text         reason
0   hi      neha    0      0:neha:hi       1
1   there   ram     1      1:ram:there     1
2   ball    neha    2      2:neha:ball     3
3   item    neha    3      3:neha:item     6
4   go there ram    4      4:ram:go there  7
5   kk       ram    5      5:ram:kk        1
6   hshs    neha    6      6:neha:hshs     2
7   ggsgs   neha    7      7:neha:ggsgs    15

所需的输出:

    source  user   time    text         reason  reason2
0   hi      neha    0      0:neha:hi       1      2
1   there   ram     1      1:ram:there     1      6
2   ball    neha    2      2:neha:ball     3      7
3   item    neha    3      3:neha:item     6      4
4   go there ram    4      4:ram:go there  7      9
5   kk       ram    5      5:ram:kk        1      2
6   hshs    neha    6      6:neha:hshs     2      3
7   ggsgs   neha    7      7:neha:ggsgs    15     1

如果我只得到一列输出也没关系,因为我可以从中拆分并制作两列.

Its is okay if i get output in just one column as i can split and make two columns from it.

推荐答案

LinearSVC 不提供 predict_proba 但它提供了 decision_function与超平面的有符号距离.

LinearSVC does not provide predict_proba but it provides the decision_function which gives the signed distance from the hyperplane.

来自文档:

decision_function(self, X):

预测样本的置信度分数.

Predict confidence scores for samples.

样本的置信度分数是该样本到超平面的有符号距离.

The confidence score for a sample is the signed distance of that sample to the hyperplane.

基于@warped 评论,

Based on @warped comments,

我们可以使用 decision_function 输出,从模型中找到前 n 个预测类.

we can use decision_function output, to find the top n predicted classes from the model.

import pandas as pd 
from sklearn.datasets import make_classification
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

X, y = make_classification(n_samples=1000, 
                           n_clusters_per_class=1,
                           n_informative=10,
                           n_classes=5, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=42)
clf = make_pipeline(StandardScaler(),
                    LinearSVC(random_state=0, tol=1e-5))
clf.fit(X, y)
top_n_classes = 2
predictions = clf.decision_function(
                    X_test).argsort()[:,-top_n_classes:][:,::-1]
pred_df = pd.DataFrame(predictions, 
                       columns= [f'{i+1}_pred' for i in range(top_n_classes)])

df = pd.DataFrame({'true_class': y_test})
df = df.assign(**pred_df)

df

这篇关于从多类分类算法输出前 2 个类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆