Weka J48分类器:无法处理数字类? [英] Weka J48 Classifier: Cannot handle numeric class?
问题描述
我现在正在尝试使用Weka在我的训练数据上构建J48(C4.5)分类器模型。
I'm now trying to build a J48 (C4.5) classifier model on my training data using Weka.
首先,我这样做了,这似乎很成功OK:
First I do this, which seems to go OK:
java -Xmx10G -cp /weka/weka.jar
weka.core.converters.TextDirectoryLoader -dir / home / test / cats>
/home/test/cats.arff
java -Xmx10G -cp /weka/weka.jar weka.core.converters.TextDirectoryLoader -dir /home/test/cats > /home/test/cats.arff
这似乎也可以:
java -Xmx10G -cp /weka/weka.jar
weka.filters.unsupervised.attribute.StringToWordVector -i
/ home / test / cats.arff -o /home/test/cats-vector.arff
java -Xmx10G -cp /weka/weka.jar weka.filters.unsupervised.attribute.StringToWordVector -i /home/test/cats.arff -o /home/test/cats-vector.arff
这行不通:
java -Xmx10G -cp /weka/weka.jar weka.classifiers.trees.J48 -t
/ home / test / cats -vector.arff -d /home/test/cats.model
java -Xmx10G -cp /weka/weka.jar weka.classifiers.trees.J48 -t /home/test/cats-vector.arff -d /home/test/cats.model
它给出以下错误:
weka.core.UnsupportedAttributeTypeException: weka.classifiers.trees.j48.C45Prune ableClassifierTree: Cannot handle numeric class!
at weka.core.Capabilities.test(Capabilities.java:954)
at weka.core.Capabilities.test(Capabilities.java:1110)
at weka.core.Capabilities.test(Capabilities.java:1023)
at weka.core.Capabilities.testWithFail(Capabilities.java:1302)
at weka.classifiers.trees.j48.C45PruneableClassifierTree.buildClassifier (C45PruneableClassifierTree.java:116)
at weka.classifiers.trees.J48.buildClassifier(J48.java:236)
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1076)
at weka.classifiers.Classifier.runClassifier(Classifier.java:312)
at weka.classifiers.trees.J48.main(J48.java:948)
所以我尝试了这个:
java -Xmx10G -cp /weka/weka.jar weka.classifiers .trees.J48 -t
/home/test/cats.arff -d /home/test/cats.model
java -Xmx10G -cp /weka/weka.jar weka.classifiers.trees.J48 -t /home/test/cats.arff -d /home/test/cats.model
还会给出错误:
weka.core.UnsupportedAttributeTypeException: weka.classifiers.trees.j48.C45PruneableClassifierTree: Cannot handle string attributes!
at weka.core.Capabilities.test(Capabilities.java:980)
at weka.core.Capabilities.test(Capabilities.java:869)
at weka.core.Capabilities.test(Capabilities.java:1085)
at weka.core.Capabilities.test(Capabilities.java:1023)
at weka.core.Capabilities.testWithFail(Capabilities.java:1302)
at weka.classifiers.trees.j48.C45PruneableClassifierTree.buildClassifier(C45PruneableClassifierTree.java:116)
at weka.classifiers.trees.J48.buildClassifier(J48.java:236)
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1076)
at weka.classifiers.Classifier.runClassifier(Classifier.java:312)
at weka.classifiers.trees.J48.main(J48.java:948)
很明显,我以某种方式准备了错误的数据(顺便说一句,输入是子目录中的文本文件,该子目录由我想要的类别)。但是我以为我遵循的是Weka Wiki上的说明:
Weka Wiki分类文本文件
Weka Wiki入门
Obviously I've prepared the data wrong somehow (BTW the input is text files in subdirectories which are named by the categories that I want). But I thought I was following the instructions on the Weka Wiki: Weka Wiki Categorizing Text Files Weka Wiki Primer
那我在做什么错?我想使用J48,因为它在测试中对我的数据具有很高的准确性。那么,如何处理我的数据以使J48分类器接受它呢?还是我需要使用其他分类器?
So what am I doing wrong? I would like to use J48 because it's given high accuracy on my data in tests. So what do I do to my data to get the J48 classifier to accept it? Or do I need to use a different classifier?
请帮助!
推荐答案
单词向量可以这样转换为二进制:
The word vectors could be converted to binary like this:
java -Xmx4G -cp /weka/weka.jar
weka.filters.unsupervised.attribute.NumericToBinary -i
/home/test/cats-vector.arff -o /home/test/cats-binary.arff
java -Xmx4G -cp /weka/weka.jar weka.filters.unsupervised.attribute.NumericToBinary -i /home/test/cats-vector.arff -o /home/test/cats-binary.arff
尽管这会增加您正在针对其训练的数据的偏见。这意味着将彼此非常接近的二进制字符串视为与远离的字符串更相似。如果您想消除这种偏见并将每个字符串视为一个完全唯一的实体,请使用 @attribute类{ABC,DEF,GHI等}
然后就可以了!
Although this adds bias to the kind of data you are training against. This implies that binary strings very close to one-another are treated as more similar to strings far away. If you want to erase this bias and regard each string as a totally unique entity then use @attribute class {ABC, DEF, G etc}
Then it works!
如果您真的想传达这些功能很重要且与所有功能都不相关,请为每个字符串写一整列,其中当行具有该类别,如果没有则为0。这样会创建非常稀疏的数据,但是学习算法会偏向于扫描该数据以获取信息。
If you really want to communicate that these features are important and not-at-all related, make a whole column for each string, where it has the value '1' for when a row has that category, and 0 when it does not. This creates very sparse data, but then the learning algorithm has a bias to scan that data for information gain.
这篇关于Weka J48分类器:无法处理数字类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!