使用LingPipe对朴素贝叶斯进行数据分类 [英] Classifying data with naive bayes using LingPipe

查看:219
本文介绍了使用LingPipe对朴素贝叶斯进行数据分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据内容将某些数据分类到不同的类中。我使用朴素贝叶斯分类器做了它,我得到一个输出作为它所属的最佳类别。但是现在我想将除了训练集之外的新闻分类为其他课程。我不能手动将除训练数据之外的每个/每个数据添加到某个类中,因为它有大量其他类别。那么有没有办法对其他数据进行分类?。

I want to classify certain data into different classes based on its content. I did it using naive bayes classifier and I get an output as the best category to which it belongs. But now I want to classify the news other than those in the training set into "others" class. I can't manually add each/every data other than the training data into a certain class since it has vast number of other categories.So is there any way to classify the other data?.

private static File TRAINING_DIR = new File("4news-train");
private static File TESTING_DIR = new File("4news-test");
private static String[] CATEGORIES = { "c1", "c2", "c3", "others" };

private static int NGRAM_SIZE = 6;

public static void main(String[] args) throws ClassNotFoundException, IOException {
    DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(CATEGORIES, NGRAM_SIZE);
    for (int i = 0; i < CATEGORIES.length; ++i) {
        File classDir = new File(TRAINING_DIR, CATEGORIES[i]);
        if (!classDir.isDirectory()) {
            String msg = "Could not find training directory=" + classDir + "\nTraining directory not found";
            System.out.println(msg); // in case exception gets lost in shell
            throw new IllegalArgumentException(msg);
        }

        String[] trainingFiles = classDir.list();
        for (int j = 0; j < trainingFiles.length; ++j) {
            File file = new File(classDir, trainingFiles[j]);
            String text = Files.readFromFile(file, "ISO-8859-1");
            System.out.println("Training on " + CATEGORIES[i] + "/" + trainingFiles[j]);
            Classification classification = new Classification(CATEGORIES[i]);
            Classified<CharSequence> classified = new Classified<CharSequence>(text, classification);
            classifier.handle(classified);
        }
    }
}


推荐答案

只是序列化对象......这意味着将中间对象写入文件,这将是你的模型...

Just serialize the object...it means write the intermediate object to a file and that will be your model...

然后进行测试你只需要将数据传递到模型中无需每次都训练......对你来说会更容易

Then for testing you just need to pass the data into the model no need to train each time...It will be quite easier for you

这篇关于使用LingPipe对朴素贝叶斯进行数据分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆