文本分类/分类算法 [英] Text classification/categorization algorithm

查看:172
本文介绍了文本分类/分类算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是为[半自动]自动分配文本,以不同的类别。有一组用户定义的类别和一套文本为每类。理想的算法应该能够从一个人定义的分类学习,然后自动进行分类的新文本。 任何人都可以提出这样一个实现ше的算法,或许.NET库?

My objective is to [semi]automatically assign texts to different categories. There's a set of user defined categories and a set of texts for each category. The ideal algorithm should be able to learn from a human-defined classification and then classify new texts automatically. Can anybody suggest such an algorithm and perhaps .NET library that implements ше?

推荐答案

这样做是不平凡的。很明显,你可以建立一个特定的关键字映射到类别的词典。只要找到一个关键字建议某一类。

Doing this is not trivial. Obviously you can build a dictionary that maps certain keywords to categories. Just finding a keyword would suggest a certain category.

然而,在自然语言文字,关键字通常会不会在他们的干形式。你会需要一些形态的工具来找到干形,并用它在字典中。

Yet, in natural language text, the keywords would usually not be in their stem form. You would need some morphology tools to find the stem form and use it on the dictionary.

后来有人能写出这样的:这文章是不是......。这将引入需要的语法和语义分析。

But then somebody could write something like: "This article is not about ...". This would introduce the need for syntax and semantical analysis.

然后你会发现,某些关键字可以在多个类别中:带,可以在使用的音乐,工艺,甚至手工业的工作。因此,您将需要一个本体论和统计学等方法来衡量该类别的概率,如果没有明确的选择。

And then you would find that certain keywords can be used in several categories: "band" could be used in musics, Technics, or even handicraft work. You would therefore need an ontology and statistical or other methods to weigh the probability of the category to choose if not definite.

一些关键字甚至可能不容易融入一个本体:是数学家接近程序员或园丁?但是,你在你的问题的类别由人类建造说,所以他们也可以帮助建立本体。

Some of the keywords might not even be easy to fit into an ontology: is mathematician closer to programmer or gardener? But you said in your question that the categories are built by men, so they could also help building the ontology.

计算语言学看看这里和的维基百科进一步研究。

Have a look on computational linguistics here and in Wikipedia for further studies.

现在,越窄领域的文本是从,更有条理他们,而较小的词汇,就越容易成为问题。

Now, the more narrow the field your texts are from, the more structured they are, and the smaller the vocabulary, the easier the problem becomes.

此外一些关键词进行进一步的研究:词法,句法分析,语义本体,计算语言学,索引,关键字提取

Again some keywords for further studies: morphology, syntax analysis, semantics, ontology, computational linguistics, indexing, keywording

这篇关于文本分类/分类算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆