如何为4个类别的文本分类创建训练数据 [英] How to create Training data for Text classification on 4 categories

查看:164
本文介绍了如何为4个类别的文本分类创建训练数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的机器学习目标是从项目需求"文档中搜索潜在风险(将花费更多的钱)和机会(将节省金钱).

My machine learning goal is to search for potential risks (will cost more money) and opportunities (will save money) from a Project Requirements document.

我的想法是将数据中的句子归为以下类别之一:风险,机会和不相关(无风险,无机会,默认类别).

My idea is to classify sentences from the data into one of these categories: Risk, Opportunity and Irrelevant (no risk, no opportunity, default categorie).

对于tf-dif,我将使用多项式贝叶斯分类器.

I will use a multinomial Bayes classifier for this with tf-dif.

现在,我需要有关训练集和测试集的数据.我这样做的方法是用3个类别中的1个标记需求文档中的每个句子.这是个好方法吗?

Now I need to have data for my training set and test set. The way I will do this is label every sentence from requirement documents with 1 of the 3 categories. Is this a good approach?

还是我应该只标记明显有风险/机会/不相关的句子?

Or should I only label sentences which are obviously a risk/opportunity/irrelevant?

此外,无关类别是个好主意吗?

Also, is the Irrelevant categorie a good idea?

推荐答案

我相信三类方法是一个很好的方法.这与情感分析相似,在情感分析中,您通常具有正面,负面和中立的文件(或句子).中性点占绝大多数实例,因此您的分类问题将不平衡.这不一定是一个问题,但是对于像这样的难题,天真的贝叶斯分类器可能会简单地对中性/无关桶中的所有内容进行分类,因为中性的先验值会很高.

I believe the three-class approach is a good one. This is similar to sentiment analysis, where you typically have positive, negative and neutral documents (or sentences). The neutral comprises the vast majority of the instances, so your classification problem will be unbalanced. That is not necessarily an issue, but for difficult problems like this one, a naive bayes classifier might simply classify everything in the neutral/irrelevant bucket since the prior for neutral will be quite high.

    您的抽样(标签)应代表现实.不要尝试创建1000个风险,1000个机会,1000个不相关的数据集.取而代之的是,抽取例如10000个要求的样本,并为每个要求分配适当的标签,即使这意味着例如风险"要比风险"多得多. 文本分类模型需要许多实例,因为搜索空间很大.我想知道您是否考虑过要获得可靠结果(例如超过90%)的事实,您可能需要手动标记数千个实例.
  • 即使您有成千上万的训练实例,您的问题也显得尤为困难,除非有一些显而易见的关键字触发我不理解的风险"或机会".问问自己:这对人类来说很容易判断吗?如果您让3名法官对您的要求进行分类,那么他们都会给出相同的答案吗?如果不是这样,那么您可能需要成千上万的培训文档,并且分类准确性可能仍然令人失望.
  • your sampling (labeling) should be representative of the reality. Don't try to create a dataset of 1000 risk, 1000 opportunity, 1000 irrelevant. Instead, take a sample of say 10000 requirements, and assign the proper label to each, even if it means having much more 'Irrelevant' than 'Risk' for instance.
  • text classification models require many instances, since the search space is vast. I wonder if you have considered the fact that to get reliable results (say over 90%), you may need to manually label thousands of instances.
  • and even if you have thousands of training instances, your problem looks particularly difficult, unless there are some obvious keywords to trigger 'risk' or 'opportunity' that I don't understand. Ask yourself: would this be easy for a human to judge? If you asked 3 judges to classify your requirements, would they all come up with the same answer? If not, then it might be 10s of thousands of training documents that you will need, and the classification accuracy may still be disappointing.

这篇关于如何为4个类别的文本分类创建训练数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆