如何使用SVM预测多类别情感分析问题中的所有类别? [英] How to predict all classes in a multi class Sentiment Analysis problem using SVM?

查看:476
本文介绍了如何使用SVM预测多类别情感分析问题中的所有类别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好吧,我正在做一个情感分析分类器,我有三个类别/标签,积极,中立和消极.我的训练数据的形状为(14640,15),其中

Well, I am making a sentiment analysis classifier and I have three classes/labels, positive, neutral and negative. The Shape of my training data is (14640, 15), where

negative    9178
neutral     3099
positive    2363

我已经对数据进行了预处理,以使其标准化,然后将词袋词矢量化技术应用于twitter文本,以使其可用于模型,该模型的大小为(14640,1000).作为Y,表示标签采用文本形式,因此,我应用了LabelEncoder,以便可以在一行中创建它.像这样-

I have pre-processed the data to make it standardized and applied the bag-of-words word vectorization technique to the text of twitter for making it feedable to the model, whose size is then (14640, 1000). As the Y, means the label is in the text form so, I applied LabelEncoder so that I can make it in a single line. Like this -

[1 2 1 ... 1 0 1]

这就是我分割数据集的方式-

This is how I split my dataset -

X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, stratify=Y, random_state=42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

out:(10248, 1000) (10248,)
(4392, 1000) (4392,)

stratify=y将使不平衡数据变成适当的加权形式.对于分类器部分,我使用了SVM-

stratify=y will make the imbalanced data into a proper weighted form. For the classifier part, I have used SVM -

svc = svm.SVC(kernel='linear', C=1, probability=True, class_weight='balanced').fit(X_train, Y_train) 
prediction = svc.predict_proba(X_test) 
prediction_int = prediction[:,1] >= 0.3 
prediction_int = prediction_int.astype(np.int) 
print(prediction_int)
print('Precision score: ', precision_score(Y_test, prediction_int, average=None))
print('Accuracy Score: ', accuracy_score(Y_test, prediction_int))

out:[0 0 0 ... 1 0 0]
Precision score:  [0.74185137 0.50075529 0.        ]
Accuracy Score:  0.6691712204007286
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

@desertnaut帮助我做出了很多决定,实际的问题是什么,最后,我看到分类器无法预测第三类.您可以看到我已经打印出prediction_int并且它没有显示任何2索引.而且,它与实际标签相去甚远.我担心分类过程中是否发生任何错误.我为二进制分类创建了该分类器,并且我认为对于多类分类不需要更改它.你们任何人都可以帮助我解决这个问题吗?

@desertnaut helped me a lot to decide, what is the actual problem, lastly, I saw that the classifier is unable to predict the third class. You can see that I have printed out prediction_int and it is not showing any 2 index. Also, it is nowhere near actual labels. I am worried if there is any mistake, happened during classification. This classifier, I made for my binary classification, and I think I do not need to change it for multi-class classification. Can any of you help me to solve this?

推荐答案

问题是您使用的Forecast_proba方法用于二进制分类.在多重分类中,它给出了每个类别的概率.

the problem is that the predict_proba method you are using is for binary classification. In a multi classification it gives the probability for each class.

您不能使用此命令:

prediction_int = prediction[:,1] >= 0.3 

有关更多信息,请参见以下类似文章:多类分类和概率预测

For futher information you can look this similiar post: Multiclass Classification and probability prediction

更新

我将所有预测功能更改为仅这一行之后就完成了-

I just made it after changing all the prediction function to just this single line -

pred = svc.predict(X_test)  

正如他所说,以前我使用的是二进制分类预测系统.现在,此predict可以对所有3个标签进行分类.因此,我的精确度和召回度现在可以正常运行.

As he told, previously I was using my binary classification prediction system. Now this predict can classify all the 3 labels. So, my precision and recall is working perfectly now.

这篇关于如何使用SVM预测多类别情感分析问题中的所有类别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆