文字分类 [英] Text Classification into Categories

查看:52
本文介绍了文字分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究文本分类问题,我正在尝试将单词集合归为一类,是的,有很多库可供分类,因此,如果您建议使用它们,请不要回答.

让我解释一下我要实现的内容. (例如)

单词列表:

  1. java
  2. 编程
  3. 语言
  4. c-sharp

类别列表.

  1. java
  2. c-sharp

在这里我们将训练该集合,如下:

  1. java映射到类别1.java
  2. 编程映射到类别1.java
  3. 编程映射到类别2.c-sharp
  4. 语言映射到类别1.java
  5. 语言映射到类别2.c-sharp
  6. c-sharp映射到类别2.c-sharp

现在我们有一个短语"最好的Java编程书" 从给定的短语中,以下单词与我们的单词列表"相匹配:

  1. java
  2. 编程

编程"具有两个映射的类别"java"和""c-sharp",这是一个普通词.

"java"仅映射到类别"java".

因此,该短语的匹配类别为"java"

这就是我的想法,这个解决方案是否还可以,是否可以实施,您的建议是什么,我错过的任何事情,缺陷等等.

解决方案

当然可以实现.如果您在正确的数据集(我想是Java和C#编程书的标题)上训练Naive Bayes分类器或线性SVM,它应该学会将术语"Java"与Java,"C#"和".NET"与C#相关联. ,并同时进行编程".也就是说,如果对数据集进行平均划分,则对于编程"之类的通用术语,朴素的贝叶斯分类器可能会学到Java或C#的大致均匀的概率.

I am working on a text classification problem, I am trying to classify a collection of words into category, yes there are plenty of libraries available for classification, so please dont answer if you are suggesting to use them.

Let me explain what I want to implement. ( take for example )

List of Words:

  1. java
  2. programming
  3. language
  4. c-sharp

List of Categories.

  1. java
  2. c-sharp

here we will train the set, as:

  1. java maps to category 1. java
  2. programming maps to category 1.java
  3. programming maps to category 2.c-sharp
  4. language maps to category 1.java
  5. language maps to category 2.c-sharp
  6. c-sharp maps to category 2.c-sharp

Now we have a phrase "The best java programming book" from the given phrase following words are a match to our "List of Words.":

  1. java
  2. programming

"programming" has two mapped categories "java" & "c-sharp" so it is a common word.

"java" is mapped to category "java" only.

So our matching category for the phrase is "java"

This is what came to my mind, is this solution fine, can it be implemented, what are your suggestions, any thing I am missing out, flaws, etc..

解决方案

Of course this can be implemented. If you train a Naive Bayes classifier or linear SVM on the right dataset (titles of Java and C# programming books, I guess), it should learn to associate the term "Java" with Java, "C#" and ".NET" with C#, and "programming" with both. I.e., a Naive Bayes classifier would likely learn a roughly even probability of Java or C# for common terms like "programming" if the dataset is divided evenly.

这篇关于文字分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆