Weka忽略未标记的数据 [英] Weka ignoring unlabeled data

查看:196
本文介绍了Weka忽略未标记的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Weka中的朴素贝叶斯分类器进行NLP分类项目.我打算使用半监督机器学习,因此可以处理未标记的数据.当我在一组独立的未标记测试数据上测试从已标记训练数据获得的模型时,Weka会忽略所有未标记实例.有人可以指导我如何解决这个问题吗?以前有人已经在这里问过这个问题,但是没有提供任何适当的解决方案.这是一个示例测试文件:

I am working on an NLP classification project using Naive Bayes classifier in Weka. I intend to use semi-supervised machine learning, hence working with unlabeled data. When I test the model obtained from my labeled training data on an independent set of unlabeled test data, Weka ignores all the unlabeled instances. Can anybody please guide me how to solve this? Someone has already asked this question here before but there wasn't any appropriate solution provided. Here is a sample test file:

@relation referents
@attribute feature1      NUMERIC
@attribute feature2      NUMERIC
@attribute feature3      NUMERIC
@attribute feature4      NUMERIC
@attribute class{1 -1}
@data
1, 7, 1, 0, ?
1, 5, 1, 0, ?
-1, 1, 1, 0, ?
1, 1, 1, 1, ?
-1, 1, 1, 1, ?

推荐答案

问题是,当您指定训练集 -t train.arff测试集 ,操作方式是根据测试集计算模型的性能.但是,如果不知道实际的类别,就无法计算出任何形式的性能.没有实际的课堂,您怎么知道您的预测是对还是错?

The problem is that when you specify a training set -t train.arff and a test set test.arff, the mode of operation is to calculate the performance of the model based on the test set. But you can't calculate a performance of any kind without knowing the actual class. Without the actual class, how will you know if your prediction if right or wrong?

我将您给定的数据用作train.arfftest.arff,并分配了我分配的任意类标签.相关的输出行是:

I used the data you gave as train.arff and as test.arff with arbitrary class labels assigned by me. The relevant output lines are:

=== Error on training data ===

Correctly Classified Instances           4               80      %
Incorrectly Classified Instances         1               20      %
Kappa statistic                          0.6154
Mean absolute error                      0.2429
Root mean squared error                  0.4016
Relative absolute error                 50.0043 %
Root relative squared error             81.8358 %
Total Number of Instances                5     


=== Confusion Matrix ===

 a b   <-- classified as
 2 1 | a = 1
 0 2 | b = -1

=== Error on test data ===

Total Number of Instances                0     
Ignored Class Unknown Instances                  5     


=== Confusion Matrix ===

 a b   <-- classified as
 0 0 | a = 1
 0 0 | b = -1

Weka可以为您提供训练集的统计信息,因为它知道实际的类别标签和预测的标签(将模型应用于训练集).对于测试集,它无法获取有关性能的任何信息,因为它不了解真正的类标签.

Weka can give you those statistics for the training set, because it knows the actual class labels and the predicted ones (applying the model on the training set). For the test set, it can't get any information about the performance, because it doesn't know about the true class labels.

您可能想做的是:

java -cp weka.jar weka.classifiers.bayes.NaiveBayes -t train.arff -T test.arff -p 1-4

在我的情况下,它将给您:

which in my case would give you:

=== Predictions on test data ===

 inst#     actual  predicted error prediction (feature1,feature2,feature3,feature4)
     1        1:?        1:1       1 (1,7,1,0)
     2        1:?        1:1       1 (1,5,1,0)
     3        1:?       2:-1       0.786 (-1,1,1,0)
     4        1:?       2:-1       0.861 (1,1,1,1)
     5        1:?       2:-1       0.861 (-1,1,1,1)

因此,您可以获得预测,但无法获得性能,因为您具有未标记的测试数据.

So, you can get the predictions, but you can't get a performance, because you have unlabeled test data.

这篇关于Weka忽略未标记的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆