在(无监督数据/tweets)上将tweet分类为多个类别 [英] Tweet classification into multiple categories on (Unsupervised data/tweets)

查看:252
本文介绍了在(无监督数据/tweets)上将tweet分类为多个类别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将这些推文归类为预定义的类别(例如:运动,健康等等).如果我已标记数据,则可以通过训练朴素贝叶斯(Naive Bayes)或SVM进行分类.如 http://cucis.ece.northwestern.edu/publications/pdf/中所述LeePal11.pdf

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf

但是我不知道使用未标记数据的方法.一种可能是使用Expectation-Maximization并生成聚类并标记这些聚类.但是如前所述,我已经预定义了一组类,所以集群将不那么理想.

But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.

任何人都可以指导我应该遵循哪些技术.感谢任何帮助.

Can anyone guide me on what techniques I should follow. Appreciate any help.

推荐答案

好的,据我了解,我认为可以通过多种方式处理此案. 会有折衷,准确率可能会有所不同.因为众所周知的事实和观察结果

Alright by what i can understand i think there are multiple ways to attend to this case. there will be trade offs and the accuracy rate may vary. because of the well know fact and observation

(除非您要基于标签和其他关键字从Twitter流api提取数据).请定义数据源以及如何提取它.我假设您只是在获得一般性推文,其中可能涉及任何内容

(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything

您可以做的是为您拥有的每个类生成一组词典 (即音乐=>流行,爵士,说唱,乐器...) 其中将包含与该班级相关的单词.您可以将 NLTK 用于python或将 Stanford NLP 用于python.其他语言.

The thing you can do is to generate a set of dictionary for each class you have (i.e Music => pop , jazz , rap , instruments ...) which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.

您可以先提取

  • 同义词
  • 同义词
  • 宫女
  • 假名
  • 同义词

去看看这些 NLP词汇语义幻灯片.它肯定会清除一些概念.

Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.

一旦您对每个班级都有字典.将它们与您获得的推文进行交叉比较.最相似的推文(您可以根据这些词典中单词的出现来对它们进行排名),可以将其标记为该类.这会使您的推文贴上其他标签. 现在的问题是准确性!但这取决于类的数据和多功能性.这可能是过度杀戮",但它可能接近您想要的.

Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others. Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.

此外,您可以通过这种方式标记一些推文,并使用余弦相似度交叉识别其他推文.这将有助于优化部分.但话又说回来,这取决于您.如您所知,您可以承受什么折衷

Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear

真正的难题将是机器学习部分以及如何管理它.

The real struggle will be the machine learning part and how you manage that.

这篇关于在(无监督数据/tweets)上将tweet分类为多个类别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆