Java文本分类问题 [英] Java text classification problem

查看:424
本文介绍了Java文本分类问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组图书对象,课程图书的定义如下:

  Class预订{

字符串标题;
ArrayList< tags>标记列表;

}

title 是标题本书的例子: Javascript for dummies



taglist 是我们示例的标签列表: Javascript,jquery,web dev,..



正如我所说,有一套书谈论不同的东西:IT,生物学,HISTORY,...
每本书都有一个标题和一组描述它的标签..



我必须将这些书自动分类为分开的主题,例如:



IT BOOKS:




  • Java for dummies

  • Javascript for dummies

  • 30天内学习闪存

  • C ++编程



HISTORY BOOKS:




  • 世界大战

  • 1960年的美国

  • 马丁·路德·金的生活



生物学书籍:




  • ....



你是男的吗?我知道一种分类算法/方法来申请那类问题吗?



一种解决方案是使用外部API来定义文本的类别,但问题在这里这些书是用不同的语言写的:法语,西班牙语,英语..

解决方案

这看起来像是一个相当简单的基于关键词的分类任务。由于您使用的是Java,因此需要考虑的好包是 Classifier4J Weka Lucene Mahout



Classifier4J



Classifier4J支持使用 天真贝叶斯 向量空间 模型。



如此 来源所示使用其朴素的贝叶斯分类器进行训练和评分的代码段 ,该软件包相当容易使用。它还在自由主义 Apache软件许可下发布。



Weka



Weka是一种非常受欢迎的数据挖掘工具。使用它的一个优点是你可以很容易地尝试使用大量不同的机器学习模型将书籍分类为主题,包括 天真的贝叶斯 决策树 支持向量机 k-nearest neighbor 逻辑回归 ,甚至 基于规则集的学习者



你会找到一个关于使用Weka进行文本分类的教程 此处



然而,Weka分发在 GPL 。您将无法将其用于要分发的封闭源软件。但是,您仍然可以使用它来支持网络服务。



Lucene Mahout



< ma> Mahout专为在非常大的数据集上进行机器学习而设计。它建立在 Apache Hadoop 之上,并支持使用朴素贝叶斯进行监督分类。



您将找到一个教程,介绍如何使用Mahout进行文本分类 这里



赞Classifier4J,Mahout在自由主义 Apache软件许可下发布。


I have a set of Books objects, classs Book is defined as following :

Class Book{

String title;
ArrayList<tags> taglist;

}

Where title is the title of the book, example : Javascript for dummies.

and taglist is a list of tags for our example : Javascript, jquery, "web dev", ..

As I said a have a set of books talking about different things : IT, BIOLOGY, HISTORY, ... Each book has a title and a set of tags describing it..

I have to classify automaticaly those books into separated sets by topic, example :

IT BOOKS :

  • Java for dummies
  • Javascript for dummies
  • Learn flash in 30 days
  • C++ programming

HISTORY BOOKS :

  • World wars
  • America in 1960
  • Martin luther king's life

BIOLOGY BOOKS :

  • ....

Do you guys know a classification algorithm/method to apply for that kind of problems ?

A solution is to use an external API to define the category of the text, but the problem here is that books are in different languages : french, spanish, english ..

解决方案

This looks like a reasonably straightforward keyword-based classification task. Since you're using Java, good packages to consider for this would be Classifier4J, Weka, or Lucene Mahout.

Classifier4J

Classifier4J supports classification using naive Bayes and a vector space model.

As seen in this source code snippet on training and scoring using its naive Bayes classifier, the package is reasonably easy to use. It's also distributed under the liberal Apache Software License.

Weka

Weka is a very popular tool for data mining. An advantage of using it is that you'd be able to readily experiment with using numerous different machine learning models to categorize the books into topics including naive Bayes, decision trees, support vector machines, k-nearest neighbor, logistic regression, and even a rule set based learner.

You'll find a tutorial on using Weka for text categorization here.

Weka is, however, distributed under the GPL. You won't be able to use it for closed source software that you want to distribute. But, you could still use it to back a web service.

Lucene Mahout

Mahout is designed for doing machine learning on very large datasets. It's built on top of Apache Hadoop and supports supervised classification using naive Bayes.

You'll find a tutorial covering how to use Mahout for text classification here.

Like Classifier4J, Mahout is distributed under the liberal Apache Software License.

这篇关于Java文本分类问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆