Sci-kit learn/python中自然文本的有效分类 [英] Effective classification of natural text in Sci-kit learn/python

查看:50
本文介绍了Sci-kit learn/python中自然文本的有效分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望我的分类算法能够并且仅当满足特定类别的某个阈值准确性(例如准确性的80%)时,才基于一组类别对基于自然语言的原始数据进行分类.我的分类器,将特定的原始文本分类为未分类"类别.我该怎么做?

I want my classification algorithm to classify my natural language based raw data based on a set of category if and only if it is going to meet a certain threshold accuracy with respect to a category(say 80% of accuracy) else I want my classifier to classify that particular raw text to a 'unclassified' category. How do I do this?

我的示例数据集:

+----------------------+------------+
| Details              | Category   |
+----------------------+------------+
| Any raw text1        | cat1       |
+----------------------+------------+
| any raw text2        | cat1       |
+----------------------+------------+
| any raw text5        | cat2       |
+----------------------+------------+
| any raw text7        | cat1       |
+----------------------+------------+
| any raw text8        | cat2       |
+----------------------+------------+
| Any raw text4        | cat4       |
+----------------------+------------+
| any raw text5        | cat4       |
+----------------------+------------+
| any raw text6        | cat3       |
+----------------------+------------+

这将是我的训练数据,我将把与测试集和训练集相同的数据进行划分

this would be my training data, I'll divide the same data as test set and train set

import pandas as pd
import numpy as np
import scipy as sp
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt  
from sklearn.model_selection import train_test_split 
data= pd.read_csv('mydata.xls.gold', delimiter='\t',usecols=
['Details','Category'],encoding='utf-8')
target_one=data['Category']
target_list=data['Category'].unique()         
x_train, x_test, y_train, y_test = train_test_split(data.Details, 
data.NUM_CATEGORY, random_state=42)
vect = CountVectorizer(ngram_range=(1,2))
#converting traning features into numeric vector
X_train = vect.fit_transform(x_train.values.astype('U'))
#converting training labels into numeric vector
X_test = vect.transform(x_test.values.astype('U'))
start = time.clock()

mnb = MultinomialNB(alpha =0.13)

mnb.fit(X_train,y_train)

result= mnb.predict(X_test)

print (time.clock()-start)

# mnb.predict_proba(x_test)[0:10,1]
accuracy_score(result,y_test)

我该如何进行?分类器是否需要设置任何参数?预先感谢.

How do I proceed ? Is there any parameter that needs to be set for the classifier? Thanks in advance.

推荐答案

您可以使用 predict_proba 结果并使用 columns = target_list 创建熊猫数据框,然后使用 max idxmax 查找测试集中每个元素的可能性最高的类别.完成后,您可以使用布尔值掩蔽和广播将低于阈值的类别设置为未分类"

You can use predict_proba result and create a pandas data-frame with columns = target_list then use max and idxmax to find the category with the highest probability for each element in the test set. once that is done you can use boolean masking and broadcasting to set the categories that's below the threshold to "unclassified"

import pandas as pd

df = pd.DataFrame(clf.predict_proba(X_test), columns=target_list)
res_df = pd.DataFrame()

res_df['max_prob'] = df.max(axis=1)
res_df['max_prob_cat'] = df.idxmax(axis=1)

res_df.loc[res_df['max_prob'] < .8, 'max_prob_cat'] = 'unclassified'

df如下所示

              cat1          cat2          cat3          cat4
0     1.091685e-06  2.257549e-04  9.994661e-01  3.070665e-04
1     2.288312e-02  9.752170e-01  1.783878e-03  1.159706e-04
2     1.980685e-01  3.494765e-01  4.416871e-01  1.076788e-02
3     2.205478e-07  9.999601e-01  3.276864e-05  6.920325e-06
4     2.736805e-03  9.795997e-01  1.718200e-02  4.815429e-04

res_df看起来像

res_df will look like

      max_prob  max_prob_cat
0     0.999466          cat3
1     0.975217          cat2
2     0.441687  unclassified
3     0.999960          cat2
4     0.979600          cat2
5     0.999956          cat2
6     0.998864          cat3
7     0.996888          cat3
8     0.999422          cat1
9     0.994412          cat3
10    0.954508          cat2
11    0.999999          cat2

这篇关于Sci-kit learn/python中自然文本的有效分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆