在java中使用朴素贝叶斯(weka)进行简单的文本分类 [英] Simple text classification using naive bayes (weka) in java

查看:625
本文介绍了在java中使用朴素贝叶斯(weka)进行简单的文本分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试在我的java代码中进行文本分类naive bayes weka libarary,但我认为分类的结果不正确,我不知道是什么问题。我使用arff文件作为输入。

I try to do text classification naive bayes weka libarary in my java code, but i think the result of the classification is not correct, i don't know what's the problem. I use arff file for the input.

这是我的训练数据:

@relation hamspam

@attribute text string
@attribute class {spam,ham}

@data
'good',ham
'good',ham
'very good',ham
'bad',spam
'very bad',spam
'very bad, very bad',spam
'good good bad',ham

这是我的testing_data:

this is my testing_data:

@relation test

@attribute text string
@attribute class {spam,ham}

@data
'good bad very bad',?
'good bad very bad',?
'good',?
'good very good',?
'bad',?
'very good',?
'very very good',?

这是我的代码:

public static void NaiveBayes(String training_file, String testing_file) throws FileNotFoundException, IOException, Exception{
         //filter
        StringToWordVector filter = new StringToWordVector();

        Classifier naive = new NaiveBayes();

        //training data
        Instances train = new Instances(new BufferedReader(new FileReader(training_file)));
        int lastIndex = train.numAttributes() - 1;
        train.setClassIndex(lastIndex);
        filter.setInputFormat(train);
        train = Filter.useFilter(train, filter);

        //testing data
        Instances test = new Instances(new BufferedReader(new FileReader(testing_file)));
        test.setClassIndex(lastIndex);
        filter.setInputFormat(test);
        Instances test2 = Filter.useFilter(test, filter);

        naive.buildClassifier(train);

        for(int i=0; i<test2.numInstances(); i++) {
            System.out.println(test.instance(i));
            double index = naive.classifyInstance(test2.instance(i));
            String className = train.attribute(0).value((int)index);
            System.out.println(className);
        }
    }

结果表明应该归类的数据分类为类别火腿的类垃圾邮件,以及应归类为类别垃圾邮件的类别垃圾邮件。有什么问题?请帮帮我..

The result indicate that the data that should have been classified into class spam classified into class ham, and the data that should have been classified into class ham classified into class spam. what's the problem?, help me please..

推荐答案

你的代码看起来不错,不过我有两条评论要做。

Your code seems fine, though i have two comments to make.


  • 首先,使用此命令设置过滤器的格式 filter.setInputFormat(train); 以便使用此过滤器并使测试和训练数据兼容。您应该使用此命令再次更改格式: filter.setInputFormat(test); 因为这可能会产生兼容性问题。

  • 而不是获取第一个属性: train.attribute(0).value((int)index); (似乎对我来说,这不符合类属性)尝试使用此命令 train.classAttribute()。value((int)index);

  • First, you set filter's format with this command filter.setInputFormat(train); so as to use this filter and make test and train data compatible. You should not change the format again with this command: filter.setInputFormat(test); as this might create compatibility issues.
  • Also instead of getting the first attribute: train.attribute(0).value((int)index); (which seems to me that is not corresponds to class attribute) try using this command train.classAttribute().value((int)index);

PS检查在Java代码中加载朴素贝叶斯模型使用weka jar 获取完整的工作流程并解释分类示例(该材料曾在SO文档中使用过)。此示例使用LibLinear分类器,但逻辑相同。

P.S. Check Load naïve Bayes model in Java code using weka jar for a complete workflow and explanation of a classification example (the material was once in SO Documentation). This example is using the LibLinear classifier but the logic is the same.

这篇关于在java中使用朴素贝叶斯(weka)进行简单的文本分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆