从多类分类算法输出前 2 个类 [英] Output top 2 classes from a multiclass classification algorithm
问题描述
我正在研究 text 的多类分类问题,我有很多不同的类(15+).我已经训练了一个 Linearsvc svm 方法(方法只是一个例子).但是它只输出概率最高的单个类,有没有办法让算法同时输出两个类
I am working on a multiclass classificiation problem for text , where I have a lot of different classes (15+). I have trained a Linearsvc svm method(method is just and example). But it outputs just single class with highest probability, Is there a way that algorithm outputs two classes at the same time
我正在使用的示例代码:
sample code i am using:
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
count_vect = CountVectorizer(max_df=.9,min_df=.002,
encoding='latin-1',
ngram_range=(1, 3))
X_train_counts = count_vect.fit_transform(df_upsampled['text'])
tfidf_transformer = TfidfTransformer(sublinear_tf=True,norm='l2')
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = LinearSVC().fit(X_train_tfidf, df_upsampled['reason'])
y_pred = model.predict(X_test)
电流输出:
source user time text reason
0 hi neha 0 0:neha:hi 1
1 there ram 1 1:ram:there 1
2 ball neha 2 2:neha:ball 3
3 item neha 3 3:neha:item 6
4 go there ram 4 4:ram:go there 7
5 kk ram 5 5:ram:kk 1
6 hshs neha 6 6:neha:hshs 2
7 ggsgs neha 7 7:neha:ggsgs 15
所需的输出:
source user time text reason reason2
0 hi neha 0 0:neha:hi 1 2
1 there ram 1 1:ram:there 1 6
2 ball neha 2 2:neha:ball 3 7
3 item neha 3 3:neha:item 6 4
4 go there ram 4 4:ram:go there 7 9
5 kk ram 5 5:ram:kk 1 2
6 hshs neha 6 6:neha:hshs 2 3
7 ggsgs neha 7 7:neha:ggsgs 15 1
如果我只得到一列输出也没关系,因为我可以从中拆分并制作两列.
Its is okay if i get output in just one column as i can split and make two columns from it.
推荐答案
LinearSVC
不提供 predict_proba
但它提供了 decision_function
与超平面的有符号距离.
LinearSVC
does not provide predict_proba
but it provides the decision_function
which gives the signed distance from the hyperplane.
来自文档:
decision_function(self, X):
预测样本的置信度分数.
Predict confidence scores for samples.
样本的置信度分数是该样本到超平面的有符号距离.
The confidence score for a sample is the signed distance of that sample to the hyperplane.
基于@warped 评论,
Based on @warped comments,
我们可以使用 decision_function
输出,从模型中找到前 n
个预测类.
we can use decision_function
output, to find the top n
predicted classes from the model.
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
X, y = make_classification(n_samples=1000,
n_clusters_per_class=1,
n_informative=10,
n_classes=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=42)
clf = make_pipeline(StandardScaler(),
LinearSVC(random_state=0, tol=1e-5))
clf.fit(X, y)
top_n_classes = 2
predictions = clf.decision_function(
X_test).argsort()[:,-top_n_classes:][:,::-1]
pred_df = pd.DataFrame(predictions,
columns= [f'{i+1}_pred' for i in range(top_n_classes)])
df = pd.DataFrame({'true_class': y_test})
df = df.assign(**pred_df)
df
这篇关于从多类分类算法输出前 2 个类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!