使用Weka预测测试数据集中的文本数据标签? [英] Predicting text data labels in test data set with Weka?

查看:473
本文介绍了使用Weka预测测试数据集中的文本数据标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Weka gui在数据集上训练SVM分类器(使用libSVM). .arff文件中的数据是

I am using the Weka gui to train a SVM classifier (using libSVM) on a dataset. The data in the .arff file is

@relation Expandtext

@attribute message string 
@attribute Class {positive, negative, objective}

@data

我使用String-to-Word Vector将其变成一袋单词,运行SVM并获得不错的分类率.现在我有了我的测试数据,我想预测他们不知道的标签.同样,它的标题信息是相同的,但是对于每个类,它都标有问号(?),即

I turn it into a bag of words with String-to-Word Vector, run SVM and get a decent classification rate. Now I have my test data I want to predict their labels which I do not know. Again it's header information is the same but for every class it is labeled with a question mark (?) ie

'Musical awareness: Great Big Beautiful Tomorrow has an ending\u002c Now is the time does not', ?

我再次对它进行了预处理,即字符串到单词向量,该类与训练数据位于相同的位置.

Again I pre-processed it, string-to-word-vector, class is in the same position as the training data.

我转到分类"菜单,加载我训练有素的SVM模型,选择提供的测试数据",加载测试数据,然后右键单击该模型,说在当前测试集上重新评估模型",但是给了我测试和训练不兼容的错误.我不确定为什么.

I go to the "classify" menu, load up my trained SVM model, select "supplied test data", load in the test data and right click on the model saying "Re-evaluate model on current test set" but it gives me the error that test and train are not compatible. I am not sure why.

我要用这种错误的方式标记测试数据吗?我在做什么错了?

Am I going about this the wrong way to label the test data? What am I doing wrong?

推荐答案

对于几乎所有的机器学习算法,训练数据和测试数据都必须具有相同的格式.这意味着两者必须具有相同的功能,即weka中的属性,并且格式必须相同,包括类.

For almost any machine learning algorithm, the training data and the test data need to have the same format. That means, both must have the same features, i.e. attributes in weka, in the same format, including the class.

问题可能是您分别对训练集和测试集进行了预处理,而StrintToWordVectorFilter将为每个集合创建不同的功能.因此,在训练集上训练的模型与测试集不兼容.

The problem is probably that you pre-process the training set and the test set independently, and the StrintToWordVectorFilter will create different features for each set. Hence, the model, trained on the training set, is incompatible to the test set.

您想做的是在训练集上初始化过滤器,然后将其应用于训练集和测试集.

What you rather want to do is initialize the filter on the training set and then apply it on both training and test set.

问题 Weka:测试文件的ReplaceMissingValues 处理了此问题,但我将在这里重复相关部分:

The question Weka: ReplaceMissingValues for a test file deals with this issue, but I'll repeat the relevant part here:

Instances train = ...   // from somewhere
Instances test = ...    // from somewhere
Filter filter = new StringToWordVector(); // could be any filter
filter.setInputFormat(train);  // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter);  // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter);    // create new test set

现在,您可以训练SVM并将结果模型应用于测试数据.

Now, you can train the SVM and apply the resulting model on the test data.

如果必须在单独的运行或程序中进行培训和测试,则应该可以序列化初始化的过滤器和模型.加载(反序列化)模型时,还可以加载过滤器并将其应用于测试数据.他们现在应该很兼容.

If training and testing have to be in separate runs or programs, it should be possible to serialize the initialized filter together with the model. When you load (deserialize) the model, you can also load the filter and apply it on the test data. They should be compatibel now.

这篇关于使用Weka预测测试数据集中的文本数据标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆