具有大量意图类的意图分类 [英] Intent classification with large number of intent classes

查看:72
本文介绍了具有大量意图类的意图分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个包含大约 3000 个问题的数据集,我想执行意图分类.数据集尚未标记,但从业务角度来看,需要识别大约80 种不同的意图类别.让我们假设我的训练数据每个类的数量大致相等,并且不会主要偏向某些类.我打算将文本转换为 word2vec 或 Glove,然后输入到我的分类器中.

I am working on a data set of approximately 3000 questions and I want to perform intent classification. The data set is not labelled yet, but from the business perspective, there's a requirement of identifying approximately 80 various intent classes. Let's assume my training data has approximately equal number of each classes and is not majorly skewed towards some of the classes. I am intending to convert the text to word2vec or Glove and then feed into my classifier.

我熟悉我拥有较少数量的意图类(例如 8 个或 10 个)以及机器学习分类器的选择(例如 SVM、naive bais 或深度学习(CNN 或 LSTM))的情况.

I am familiar with cases in which I have a smaller number of intent classes, such as 8 or 10 and the choice of machine learning classifiers such as SVM, naive bais or deeplearning (CNN or LSTM).

我的问题是,如果您之前有过处理如此大量意图类的经验,您认为哪种机器学习算法会表现得合理?您认为如果我使用深度学习框架,鉴于上述训练数据,大量标签仍然会导致性能不佳吗?

My question is that if you have had experience with such large number of intent classes before, and which of machine learning algorithm do you think will perform reasonably? do you think if i use deep learning frameworks, still large number of labels will cause poor performance given the above training data?

我们需要开始给数据打标签,想出 80 类标签然后意识到它表现不佳是相当费力的,所以我想确保我对如何做出正确的决定许多类别的最大意图我应该考虑,您建议使用哪种机器学习算法?

We need to start labelling the data and it is rather laborious to come up with 80 classes of labels and then realise that it is not performing well, so I want to ensure that I am making the right decision on how many classes of intent maximum I should consider and what machine learning algorithm do you suggest?

提前致谢...

推荐答案

首先,word2vec 和 GloVe 几乎已经死了.您可能应该考虑使用更新的嵌入,如 BERT 或 ELMo(两者都对上下文敏感;换句话说,您在不同的上下文中为同一个词获得不同的嵌入).目前,BERT 是我自己的偏好,因为它完全开源并且可用(几天前发布了 gpt-2,这显然要好一些.但是,它并不完全对公众可用).

First, word2vec and GloVe are, almost, dead. You should probably consider using more recent embeddings like BERT or ELMo (both of which are sensitive to the context; in other words, you get different embeddings for the same word in a different context). Currently, BERT is my own preference since it's completely open-source and available (gpt-2 was released a couple of days ago which is apparently a little bit better. But, it's not completely available to the public).

第二,当您使用 BERT 的预训练嵌入时,您的模型具有查看大量文本(Google 海量)的优势,因此可以在少量数据上进行训练,这将大大提高其性能.

Second, when you use BERT's pre-trained embeddings, your model has the advantage of seeing a massive amount of text (Google massive) and thus can be trained on small amounts of data which will increase it's performance drastically.

最后,如果您可以将您的意图分类为一些粗粒度类,您可以训练一个分类器来指定您的实例属于这些粗粒度类中的哪一个.然后,对于每个粗粒度的类,训练另一个分类器来指定细粒度的分类器.这种层次结构可能会改善结果.同样对于分类器的类型,我相信在 BERT 之上的一个简单的全连接层就足够了.

Finally, if you could classify your intents into some coarse-grained classes, you could train a classifier to specify which of these coarse-grained classes your instance belongs to. Then, for each coarse-grained class train another classifier to specify the fine-grained one. This hierarchical structure will probably improve the results. Also for the type of classifier, I believe a simple fully connected layer on top of BERT would suffice.

这篇关于具有大量意图类的意图分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆