在命令行上学习Weka [英] Learning Weka on the Command Line

查看:134
本文介绍了在命令行上学习Weka的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Weka相当陌生,在命令行上对Weka甚至还比较陌生.我发现文档很差,我正在努力找出一些要做的事情.例如,要获取两个.arff文件,一个用于培训,一个用于测试,并获取测试数据中缺少标签的预测输出.

我该怎么做?

我将此代码作为起点

java -classpath weka.jar weka.classifiers.meta.FilteredClassifier
-t "training_file_with_missing_values.arff"
-T "test_file_with_missing_values.arff"
-F weka.filters.unsupervised.attribute.ReplaceMissingValues -- -c last
-W weka.classifiers.functions.MultilayerPerceptron -- -L 0.3 -M 0.2 -H a

运行该代码会给我最后一个-c非法选项",我不确定为什么.我也不会使用MLP,因为当我从文本数据中获得数千个特征时,NN往往会变得太慢.我知道如何将其更改为另一个分类器(例如NB或libSVM,这样很好).

但是我不确定如何在一个调用中添加多个过滤器,因为我还需要添加StringToWordVector过滤器(可能还需要添加Reorder过滤器以使类成为最后一个属性,而不是第一个属性).

然后我该如何获取它实际输出每个类的预测标签?然后将这些存储在带有初始数据的Arff中.

解决方案

Weka并不是真正出色的文档示例,但是您仍然可以在其站点上找到有关它的有价值的信息.您应该以 Primer开始.我了解您要对文本文件进行分类,因此您还应该查看 Weka文档站点.

[编辑":Wikispaces已关闭,Weka尚未在其他地方提供站点,因此,我修改了链接以指向Google缓存.如果有人阅读了此书,并且新的Weka Wiki已经启动,请随时编辑链接并删除此注释.]

您在问题中发布的命令行包含错误.我知道,您将其从我的回答中复制到另一个问题,但我也注意到了.您必须省略-- -c last,因为ReplaceMissingValue过滤器不喜欢它.

在入门"中它说:

weka.filters.supervised

在类层次结构中受监督的weka.filters.以下的类用于受监督的过滤,即利用类信息.必须通过-c分配一个类,对于WEKA默认行为,请使用-c last.

,但ReplaceMissingValue无人监督过滤器,StringToWordVector也是

.

多个过滤器

添加多个过滤器也没有问题,这就是MultiFilter的用途.但是,命令行可能会有些混乱:(我在这里选择RandomForest,因为它比NN快很多).

java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
  -t ~/weka-3-7-9/data/ReutersCorn-train.arff \
  -T ~/weka-3-7-9/data/ReutersCorn-test.arff \
 -F "weka.filters.MultiFilter \
     -F weka.filters.unsupervised.attribute.StringToWordVector \
     -F weka.filters.unsupervised.attribute.Standardize" \
 -W weka.classifiers.trees.RandomForest -- -I 100 \

做出预测

以下是Primer关于获得预测的内容:

但是,如果需要有关分类器预测的更详细信息,则-p#仅输出每个测试实例的预测,以及一系列基于1的属性ID(0表示无).

一个好的约定是将诸如-p 0之类的常规选项直接放在您要调用的类之后,因此命令行应为

java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
  -t ~/weka-3-7-9/data/ReutersCorn-train.arff \
  -T ~/weka-3-7-9/data/ReutersCorn-test.arff \
  -p 0 \
 -F "weka.filters.MultiFilter \
     -F weka.filters.unsupervised.attribute.StringToWordVector \
     -F weka.filters.unsupervised.attribute.Standardize" \
 -W weka.classifiers.trees.RandomForest -- -I 100 \

WEKA分类器/过滤器的结构

但是如您所见,从命令行调用WEKA时,WEKA会变得非常复杂.这是由于WEKA分类器和过滤器的树结构所致.尽管每个命令行只能运行一个分类器/过滤器,但是您可以根据需要将其构造为复杂的结构.对于上面的命令,结构如下所示:

FilteredClassifier将在训练数据集上初始化过滤器,同时过滤训练数据和测试数据,然后在训练数据上训练模型并对给定的测试数据进行分类.

FilteredClassifier
 |
 + Filter
 |
 + Classifier

如果我们需要多个过滤器,我们可以使用MultiFilter,它只是一个过滤器,但是会按照给定的顺序调用其他多个过滤器.

FilteredClassifier
 |
 + MultiFilter
 |  |
 |  + StringToWordVector
 |  |
 |  + Standardize
 |
 + RandomForest

从命令行运行类似内容的困难部分是将所需的选项分配给正确的类,因为通常选项名称是相同的.例如,-F选项也用于FilteredClassifierMultiFilter,因此我必须使用引号来明确指出-F属于哪个过滤器.

在最后一行中,您看到不能直接附加属于RandomForest的选项-I 100,因为这样会将其分配给FilteredClassifier,您将得到Illegal options: -I 100.因此,您必须在其前面添加--.

将预测添加到数据文件中

添加预测的类标签也是可以的,但是更加复杂. AFAIK无法一步完成,但是您必须首先训练并保存模型,然后使用该模型来预测和分配新的类标签.

训练并保存模型:

java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
  -t ~/weka-3-7-9/data/ReutersCorn-train.arff \
  -d rf.model \
  -F "weka.filters.MultiFilter \
      -F weka.filters.unsupervised.attribute.StringToWordVector \
      -F weka.filters.unsupervised.attribute.Standardize" \
  -W weka.classifiers.trees.RandomForest -- -I 100 \

这会将训练有素的FilteredClassifier的模型序列化为文件rf.model.重要的是初始化的过滤器也将被序列化,否则过滤后测试集将不兼容.

加载模型,进行预测并保存:

java -classpath weka.jar weka.filters.supervised.attribute.AddClassification \
  -serialized rf.model \
  -classification \
  -remove-old-class \
  -i ~/weka-3-7-9/data/ReutersCorn-test.arff \
  -o pred.arff \
  -c last

I am fairly new to Weka and even more new to Weka on the command line. I find documentation is poor and I am struggling to figure out a few things to do. For example, want to take two .arff files, one for training, one for testing and get an output of predictions for the missing labels in the test data.

How can I do this?

I have this code as a starting block

java -classpath weka.jar weka.classifiers.meta.FilteredClassifier
-t "training_file_with_missing_values.arff"
-T "test_file_with_missing_values.arff"
-F weka.filters.unsupervised.attribute.ReplaceMissingValues -- -c last
-W weka.classifiers.functions.MultilayerPerceptron -- -L 0.3 -M 0.2 -H a

Running that code gives me "Illegal option -c last" and I am not sure why. I am also not going to be using MLP as NN tend to be too slow when I have a few thousand features from the text data. I know how to change it to another classifier though (like NB or libSVM so that is good).

But I am not sure how to add multiple filters in one call as I also need to add the StringToWordVector filter (and possibly the Reorder filter to make the class the last, instead of first attribute).

And then how do I get it actually output me the prediction labels of each class? And then store so those in an arff with the initial data.

解决方案

Weka is not really the shining example of documentation, but you can still find valuable information about it on their sites. You should start with the Primer. I understand that you want to classify text files, so you should also have a look at Text categorization with WEKA. There is also a new Weka documentation site.

[Edit: Wikispaces has shut down and Weka hasn't brought up the sites somewhere else, yet, so I've modified the links to point at the Google cache. If someone reads this and a new Weka Wiki is up, feel free to edit the links and remove this note.]

The command line you posted in your question contains an error. I know, you copied it from my answer to another question, but I also just noticed it. You have to omit the -- -c last, because the ReplaceMissingValue filter doesn't like it.

In the Primer it says:

weka.filters.supervised

Classes below weka.filters.supervised in the class hierarchy are for supervised filtering, i.e. taking advantage of the class information. A class must be assigned via -c, for WEKA default behaviour use -c last.

but ReplaceMissingValue is an unsupervised filter, as is StringToWordVector.

Multiple filters

Adding multiple filter is also no problem, that is what the MultiFilter is for. The command line can get a bit messy, though: (I chose RandomForest here, because it is a lot faster than NN).

java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
  -t ~/weka-3-7-9/data/ReutersCorn-train.arff \
  -T ~/weka-3-7-9/data/ReutersCorn-test.arff \
 -F "weka.filters.MultiFilter \
     -F weka.filters.unsupervised.attribute.StringToWordVector \
     -F weka.filters.unsupervised.attribute.Standardize" \
 -W weka.classifiers.trees.RandomForest -- -I 100 \

Making predictions

Here is what the Primer says about getting the prediction:

However, if more detailed information about the classifier's predictions are necessary, -p # outputs just the predictions for each test instance, along with a range of one-based attribute ids (0 for none).

It is a good convention to put those general options like -p 0 directly after the class you're calling, so the command line would be

java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
  -t ~/weka-3-7-9/data/ReutersCorn-train.arff \
  -T ~/weka-3-7-9/data/ReutersCorn-test.arff \
  -p 0 \
 -F "weka.filters.MultiFilter \
     -F weka.filters.unsupervised.attribute.StringToWordVector \
     -F weka.filters.unsupervised.attribute.Standardize" \
 -W weka.classifiers.trees.RandomForest -- -I 100 \

Structure of WEKA classifiers/filters

But as you can see, WEKA can get very complicated when calling it from the command line. This is due to the tree structure of WEKA classifiers and filters. Though you can run only one classifier/filter per command line, it can be structured as complex as you like. For the above command, the structure looks like this:

The FilteredClassifier will initialize a filter on the training data set, filter both training and test data, then train a model on the training data and classify the given test data.

FilteredClassifier
 |
 + Filter
 |
 + Classifier

If we want multiple filters, we use the MultiFilter, which is only one filter, but it calls multiple others in the order they were given.

FilteredClassifier
 |
 + MultiFilter
 |  |
 |  + StringToWordVector
 |  |
 |  + Standardize
 |
 + RandomForest

The hard part of running something like this from the command line is assigning the desired options to the right classes, because often the option names are the same. For example, the -F option is used for the FilteredClassifier and the MultiFilter as well, so I had to use quotes to make it clear which -F belongs to what filter.

In the last line, you see that the option -I 100, which belongs to the RandomForest, can't be appended directly, because then it would be assigned to FilteredClassifier and you will get Illegal options: -I 100. Hence, you have to add -- before it.

Adding predictions to the data files

Adding the predicted class label is also possible, but even more complicated. AFAIK this can't be done in one step, but you have to train and save a model first, then use this one for predicting and assigning new class labels.

Training and saving the model:

java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
  -t ~/weka-3-7-9/data/ReutersCorn-train.arff \
  -d rf.model \
  -F "weka.filters.MultiFilter \
      -F weka.filters.unsupervised.attribute.StringToWordVector \
      -F weka.filters.unsupervised.attribute.Standardize" \
  -W weka.classifiers.trees.RandomForest -- -I 100 \

This will serialize the model of the trained FilteredClassifier to the file rf.model. The important thing here is that the initialized filter will also be serialized, otherwise the test set wouldn't be compatible after filtering.

Loading the model, making predictions and saving it:

java -classpath weka.jar weka.filters.supervised.attribute.AddClassification \
  -serialized rf.model \
  -classification \
  -remove-old-class \
  -i ~/weka-3-7-9/data/ReutersCorn-test.arff \
  -o pred.arff \
  -c last

这篇关于在命令行上学习Weka的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆