如何用标签概率对文本进行分类? [英] How to do text classification with label probabilities?

查看:202
本文介绍了如何用标签概率对文本进行分类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

出于学术目的,我正在尝试解决文本分类问题.我需要将推文分类为云",冷",干",热",潮湿",飓风",冰",雨",雪",暴风雨",风"和其他".训练数据中的每条推文都有针对所有标签的概率.说出这样的信息:已经可以说这将是艰难的一天.现在和昨天下午一样多风."有21%的机会变热,有79%的机会变风.我已经研究了分类问题,该问题可以预测其是风还是热还是其他.但是在这个问题上,每个训练数据都有针对所有标签的概率.我以前使用过mahout天真的贝叶斯分类器,该分类器为给定的文本采用特定的标签来构建模型.如何将各种标签的输入概率转换为任何分类器的输入?

I'm trying to solve a text classification problem for academic purpose. I need to classify the tweets into labels like "cloud" ,"cold", "dry", "hot", "humid", "hurricane", "ice", "rain", "snow", "storms", "wind" and "other". Each tweet in training data has probabilities against all the label. Say the message "Can already tell it's going to be a tough scoring day. It's as windy right now as it was yesterday afternoon." has 21% chance for being hot and 79% chance for wind. I have worked on the classification problems which predicts whether its wind or hot or others. But in this problem, each training data has probabilities against all the labels. I have previously used mahout naive bayes classifier which take a specific label for a given text to build model. How to convert these input probabilities for various labels as input to any classifier?

推荐答案

在概率设置中,这些概率反映了有关训练实例的类标签的不确定性.这会影响分类器中的参数学习.

In a probabilistic setting, these probabilities reflect uncertainty about the class label of your training instance. This affects parameter learning in your classifier.

有一个自然的方法可以将其合并:例如,在朴素贝叶斯(Naive Bayes)中,当估计模型中的参数时,不是每个单词都为文档所属的类计数一个,而是得到一个概率.因此,极有可能属于某个类别的文档为该类别的参数做出了更大的贡献.这种情况与使用EM学习多项式混合模型时完全相同,在这种情况下,您拥有的概率与实例的隶属关系/指标变量相同.

There's a natural way to incorporate this: in Naive Bayes, for instance, when estimating parameters in your models, instead of each word getting a count of one for the class to which the document belongs, it gets a count of probability. Thus documents with high probability of belonging to a class contribute more to that class's parameters. The situation is exactly equivalent to when learning a mixture of multinomials model using EM, where the probabilities you have are identical to the membership/indicator variables for your instances.

或者,如果您的分类器是带有softmax输出的神经网络,而不是目标输出是具有单个[1]和很多零的向量,则目标输出将成为您提供的概率向量.

Alternatively, if your classifier were a neural net with softmax output, instead of the target output being a vector with a single [1] and lots of zeros, the target output becomes the probability vector you're supplied with.

不幸的是,我不知道有任何标准的实现方式可以让您结合这些想法.

I don't, unfortunately, know of any standard implementations that would allow you to incorporate these ideas.

这篇关于如何用标签概率对文本进行分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆