Twitter情绪分析技术 [英] Twitter sentiment analysis technics

查看:121
本文介绍了Twitter情绪分析技术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一个关于Twitter情绪分析的项目,但是我需要考虑一些事情.

I'm doing a project on twitter sentiment analysis but there're some things I ponder over.

由于推文非常短(少于140个字符),因此哪种文本分析技术最适用.例如.词干功能是否与-let说长篇文章一样有效?

Since tweets are extremely short (less than 140 chars) what text analysis technics apply best. For example. Does stemming work as well as in -let's say- long articles?

n元语法怎么样?推文的简短对他们来说是最好还是最坏?

What about n-grams? Does the shortness of the tweet make it best or worst for the them?

k最近是否比语音标记的一部分更准确?

Would k-nearest be more accurate than part of speech tagging?

随着时间的流逝,我的自定义twitter数据集会变得无关紧要/损坏吗?由于Twitter及其相关信息变化如此之快,这也是我的主要担忧.

Will my custom twitter dataset become irrelevant/corrupt as time goes by? Since twitter and the info on it changes so fast that also a major concern for me.

非常感谢您的时间.

PS:您是否牢记任何良好的Twitter情绪数据集?如果它定期更新,那就太好了.

PS: Do you have in mind any good twitter sentiment dataset? Would be great if it updates regularly.

推荐答案

我做了一些课堂分析,分析名人推文并比较它们的相似性.

I did some classwork analyzing celebrities tweets and comparing their similarities.

您想到的最大的事情是一条tweet的长度.在140个字符的情况下,许多单词会被缩短,或者是不寻常的"txt语音".因此,即使是众所周知的词干,例如 Porter 也会给出一些奇怪的结果.最好保留几乎所有内容,并且仅在字数,向量等之后才归一化.

The biggest thing, which you figured, is the length of a tweet. At 140 chars a lot of words are shortened, or unusual "txt-speech". So even a well know stemmer such as Porter is going to give some odd results. It was best to keep almost everything and only normalize after words counts, vectors, etc.

对于单词的推断,n-gram和以下链接是进行质量推断的重要因素.我只能忍受4克的空间和时间要求,但是即使创建简单的2克也可以带来很大的进步.

For extrapolating from the words, n-grams and following links are a big factor for quality inference. I could only tolerate the space and time requirements of 4-grams, but even creating simple 2-grams gave a large improvement.

如果您注意到我之前说过几乎所有内容".在我仅关注流行的名人推文的情况下,我遇到了一个问题,即他们的很多推文都是与他们的活动或赞助商的链接或喊叫声.所以很大一部分是删除了大量的垃圾邮件副本.

If you noticed I said earlier "almost everything". In my case of following only popular celeb tweets, I ran into the problem that alot of their tweets were links or shout outs to their events, or sponsors, etc. So a big part was removing the large duplicates of spam.

对于提取准确情绪的方法或您要寻找的任何量度方法,我将首先尝试基于朴素贝叶斯的方法.对于基线而言,它是简单且相对准确的. K均值会做得很好,但请记住,它不考虑方差和协方差,但是仍然可以尝试使用另一个基准.

For the methods to extract accurate sentiment or whatever measures your looking for, I would first try naive bayes based methods. It is simple and relatively accurate for a baseline. K-means will do fairly well but remember that it does not take into account variances and co-variances, but nonetheless is another baseline to try.

希望能提供一些见识.

这篇关于Twitter情绪分析技术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆