Weka忽略未标记的数据 [英] Weka ignoring unlabeled data
问题描述
我正在使用Weka中的朴素贝叶斯分类器进行NLP分类项目.我打算使用半监督机器学习,因此可以处理未标记的数据.当我在一组独立的未标记测试数据上测试从已标记训练数据获得的模型时,Weka会忽略所有未标记实例.有人可以指导我如何解决这个问题吗?以前有人已经在这里问过这个问题,但是没有提供任何适当的解决方案.这是一个示例测试文件:
I am working on an NLP classification project using Naive Bayes classifier in Weka. I intend to use semi-supervised machine learning, hence working with unlabeled data. When I test the model obtained from my labeled training data on an independent set of unlabeled test data, Weka ignores all the unlabeled instances. Can anybody please guide me how to solve this? Someone has already asked this question here before but there wasn't any appropriate solution provided. Here is a sample test file:
@relation referents
@attribute feature1 NUMERIC
@attribute feature2 NUMERIC
@attribute feature3 NUMERIC
@attribute feature4 NUMERIC
@attribute class{1 -1}
@data
1, 7, 1, 0, ?
1, 5, 1, 0, ?
-1, 1, 1, 0, ?
1, 1, 1, 1, ?
-1, 1, 1, 1, ?
推荐答案
问题是,当您指定训练集 -t train.arff
和测试集
The problem is that when you specify a training set -t train.arff
and a test set test.arff
, the mode of operation is to calculate the performance of the model based on the test set. But you can't calculate a performance of any kind without knowing the actual class. Without the actual class, how will you know if your prediction if right or wrong?
我将您给定的数据用作train.arff
和test.arff
,并分配了我分配的任意类标签.相关的输出行是:
I used the data you gave as train.arff
and as test.arff
with arbitrary class labels assigned by me. The relevant output lines are:
=== Error on training data ===
Correctly Classified Instances 4 80 %
Incorrectly Classified Instances 1 20 %
Kappa statistic 0.6154
Mean absolute error 0.2429
Root mean squared error 0.4016
Relative absolute error 50.0043 %
Root relative squared error 81.8358 %
Total Number of Instances 5
=== Confusion Matrix ===
a b <-- classified as
2 1 | a = 1
0 2 | b = -1
和
=== Error on test data ===
Total Number of Instances 0
Ignored Class Unknown Instances 5
=== Confusion Matrix ===
a b <-- classified as
0 0 | a = 1
0 0 | b = -1
Weka可以为您提供训练集的统计信息,因为它知道实际的类别标签和预测的标签(将模型应用于训练集).对于测试集,它无法获取有关性能的任何信息,因为它不了解真正的类标签.
Weka can give you those statistics for the training set, because it knows the actual class labels and the predicted ones (applying the model on the training set). For the test set, it can't get any information about the performance, because it doesn't know about the true class labels.
您可能想做的是:
java -cp weka.jar weka.classifiers.bayes.NaiveBayes -t train.arff -T test.arff -p 1-4
在我的情况下,它将给您:
which in my case would give you:
=== Predictions on test data ===
inst# actual predicted error prediction (feature1,feature2,feature3,feature4)
1 1:? 1:1 1 (1,7,1,0)
2 1:? 1:1 1 (1,5,1,0)
3 1:? 2:-1 0.786 (-1,1,1,0)
4 1:? 2:-1 0.861 (1,1,1,1)
5 1:? 2:-1 0.861 (-1,1,1,1)
因此,您可以获得预测,但无法获得性能,因为您具有未标记的测试数据.
So, you can get the predictions, but you can't get a performance, because you have unlabeled test data.
这篇关于Weka忽略未标记的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!