文字分类方法? SVM和决策树 [英] text classification methods? SVM and decision tree

查看:103
本文介绍了文字分类方法? SVM和决策树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个训练集,我想根据我的训练集使用分类方法对其他文档进行分类。我的文档类型是新闻,类别是体育,政治,经济等。

i have a training set and i want to use a classification method for classifying other documents according to my training set.my document types are news and categories are sports,politics,economic and so on.

我完全了解朴素贝叶斯和KNN,但SVM和决策树含糊不清,我不知道我是否可以自己实现此方法?还是有使用该方法的应用程序?

i understand naive bayes and KNN completely but SVM and decision tree are vague and i dont know if i can implement this method by myself?or there is applications for using this methods?

以这种方式对文档进行分类的最佳方法是什么?

what is the best method i can use for classifying docs in this way?

谢谢!

推荐答案


  • 朴素贝叶斯

    • Naive Bayes
    • 尽管这是最简单的算法,并且一切都被认为是独立的,但在实文本分类的情况下,此方法效果很好。而且我会确定先尝试使用此算法。

      Though this is the simplest algorithm and everything is deemed independent, in real text classification case, this method work great. And I would try this algorithm first for sure.


      • KNN

      • KNN

      KNN用于聚类而不是分类。我认为您误解了聚类和分类的概念。

      KNN is for clustering rather than classification. I think you misunderstand the conception of clustering and classification.


      • SVM

      • SVM

      SVM具有SVC(分类)和SVR(回归)算法来进行类分类和预测。有时候效果不错,但是根据我的经验,它对文本分类的性能很差,因为它对好的标记器(过滤器)有很高的要求。但是数据集的字典中总是有脏记号。准确性确实很差。

      SVM has SVC(classification) and SVR(Regression) algorithms to do class classification and prediction. It sometime works good, but from my experiences, it has bad performance in text classification, as it has high demands for good tokenizers (filters). But the dictionary of the dataset always has dirty tokens. The accuracy is really bad.


      • 随机森林(决策树)

      • Random Forest (decision tree)

      我从来没有尝试过这种用于文本分类的方法。因为我认为决策树需要几个关键节点,所以很难找到用于文本分类的几个关键令牌,而随机森林对于高稀疏维度不利。

      I've never try this method for text classification. Because I think decision tree need several key nodes, while it's hard to find "several key tokens" for text classification, and random forest works bad for high sparse dimensions.

      FYI

      这些都是我的经验,但是对于您而言,您没有更好的方法来决定使用哪种方法,而是尝试每种算法

      These are all from my experiences, but for your case, you have no better ways to decide which methods to use but to try every algorithm to fit your model.

      Apache的Mahout是用于机器学习算法的出色工具。它集成了三个方面的算法:推荐,聚类和分类。您可以尝试此库。但是您必须学习一些有关Hadoop的基础知识。

      Apache's Mahout is a great tool for machine learning algorithms. It integrates three aspects' algorithms: recommendation, clustering, and classification. You could try this library. But you have to learn some basic knowledge about Hadoop.

      对于机器学习,weka是一种用于集成许多算法的体验的软件工具包。

      And for machine learning, weka is a software toolkit for experiences which integrates many algorithms.

      这篇关于文字分类方法? SVM和决策树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆