多类别分类和概率预测 [英] Multiclass Classification and probability prediction

查看:794
本文介绍了多类别分类和概率预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

import pandas as pd
import numpy
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB

fi = "df.csv"
# Open the file for reading and read in data
file_handler = open(fi, "r")
data = pd.read_csv(file_handler, sep=",")
file_handler.close()

# split the data into training and test data
train, test = cross_validation.train_test_split(data,test_size=0.6, random_state=0)
# initialise Gaussian Naive Bayes
naive_b = GaussianNB()


train_features = train.ix[:,0:127]
train_label = train.iloc[:,127]

test_features = test.ix[:,0:127]
test_label = test.iloc[:,127]

naive_b.fit(train_features, train_label)
test_data = pd.concat([test_features, test_label], axis=1)
test_data["p_malw"] = naive_b.predict_proba(test_features)

print "test_data\n",test_data["p_malw"]
print "Accuracy:", naive_b.score(test_features,test_label)

我已经编写了这段代码,以接受来自具有128列的csv文件的输入,其中127列是要素,第128列是类标签.

I have written this code to accept input from a csv file with 128 columns where 127 columns are features and the 128th column is the class label.

我想预测样本属于每个类别的概率(有5个类别(1-5)),并打印成矩阵形式,然后根据预测结果确定样本的类别. Forecast_proba()没有提供所需的输出.请提出所需的更改.

I want to predict probability that the sample belongs to each class (There are 5 classes (1-5)) and print it in for of a matrix and determine the class of sample based on the prediction. predict_proba() is not giving the desired output. Please suggest required changes.

推荐答案

GaussianNB.predict_proba返回模型中每个类的样本概率.在您的情况下,它应返回一个包含五列的结果,该列的行数与测试数据中的行数相同.您可以使用naive_b.classes_验证哪一列对应于哪个类.因此,不清楚您为什么要说这不是所需的输出.也许,您的问题来自以下事实:您正在将预测Proba的输出分配给数据帧列.试试:

GaussianNB.predict_proba returns the probabilities of the samples for each class in the model. In your case, it should return a result with five columns with the same number of rows as in your test data. You can verify which column corresponds to which class using naive_b.classes_ . So, it is not clear why you are saying that this is not the desired output. Perhaps, your problem comes from the fact that you are assigning the output of predict proba to a data frame column. Try:

pred_prob = naive_b.predict_proba(test_features)

代替

test_data["p_malw"] = naive_b.predict_proba(test_features)

并使用pred_prob.shape验证其形状.第二个维度应为5.

and verify its shape using pred_prob.shape. The second dimension should be 5.

如果想要每个样本的预测标签,则可以使用预测方法,然后使用混淆矩阵来查看已正确预测了多少个标签.

If you want the predicted label for each sample you can use the predict method, followed by confusion matrix to see how many labels have been predicted correctly.

from sklearn.metrics import confusion_matrix

naive_B.fit(train_features, train_label)

pred_label = naive_B.predict(test_features)

confusion_m = confusion_matrix(test_label, pred_label)
confusion_m

这是一些有用的读物​​.

Here is some useful reading.

sklearn GaussianNB- http://scikit-learn.org/stable/modules/generation/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict_proba

sklearn GaussianNB - http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict_proba

sklearn confusion_matrix- http://scikit-learn. org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

sklearn confusion_matrix - http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

这篇关于多类别分类和概率预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆