Twitter在Python中的情感分析 [英] Sentiment analysis for Twitter in Python

查看:96
本文介绍了Twitter在Python中的情感分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找文本情感分析( http://en.wikipedia.org/wiki/Sentiment_analysis ).有谁熟悉我可以使用的这种开源实现?

I'm looking for an open source implementation, preferably in python, of Textual Sentiment Analysis (http://en.wikipedia.org/wiki/Sentiment_analysis). Is anyone familiar with such open source implementation I can use?

我正在编写一个应用程序,该应用程序在Twitter上搜索某个搜索词,例如"youtube",并计算快乐"推文与悲伤"推文的数量. 我正在使用Google的appengine,所以它在python中.我希望能够对来自twitter的返回搜索结果进行分类,并且我希望在python中进行分类. 到目前为止,我还没有找到这样的情感分析器,特别是在python中. 您熟悉我可以使用的这种开源实现吗?最好是已经在python中了,但是如果没有,希望我可以将其翻译成python.

I'm writing an application that searches twitter for some search term, say "youtube", and counts "happy" tweets vs. "sad" tweets. I'm using Google's appengine, so it's in python. I'd like to be able to classify the returned search results from twitter and I'd like to do that in python. I haven't been able to find such sentiment analyzer so far, specifically not in python. Are you familiar with such open source implementation I can use? Preferably this is already in python, but if not, hopefully I can translate it to python.

请注意,我正在分析的文本非常简短,它们是推文.因此,理想情况下,此分类器已针对此类短文本进行了优化.

Note, the texts I'm analyzing are VERY short, they are tweets. So ideally, this classifier is optimized for such short texts.

顺便说一句,twitter确实在搜索中支持":)"和":("运算符,其目的只是为了做到这一点,但是不幸的是,它们提供的分类并不是那么好,所以我想我可以给出这个自己尝试一下.

BTW, twitter does support the ":)" and ":(" operators in search, which aim to do just this, but unfortunately, the classification provided by them isn't that great, so I figured I might give this a try myself.

谢谢!

顺便说一句,早期的演示是此处到目前为止,我的代码是此处,我很乐意与任何感兴趣的人一起开源开发人员.

BTW, an early demo is here and the code I have so far is here and I'd love to opensource it with any interested developer.

推荐答案

对于大多数这类应用程序,您必须投放大量自己的代码来进行统计分类任务.正如Lucka所建议的那样,只要您的目标不干扰其许可证的非商业性质,NLTK就是在Python中进行自然语言操作的理想工具.但是,我建议使用其他软件包进行建模.我还没有找到许多可用于Python的强大高级机器学习模型,因此我将建议一些易于与Python配合使用的独立二进制文件.

With most of these kinds of applications, you'll have to roll much of your own code for a statistical classification task. As Lucka suggested, NLTK is the perfect tool for natural language manipulation in Python, so long as your goal doesn't interfere with the non commercial nature of its license. However, I would suggest other software packages for modeling. I haven't found many strong advanced machine learning models available for Python, so I'm going to suggest some standalone binaries that easily cooperate with it.

您可能对高级判别建模工具包感兴趣,该工具包可轻松与Python交互.这已用于自然语言处理各个领域中的分类任务.您还可以选择许多不同的模型.我建议您从最大熵分类开始,只要您已经熟悉实现朴素贝叶斯分类器即可.如果不是这样,您可能需要研究一下并进行编码,以真正了解统计分类作为机器学习任务.

You may be interested in The Toolkit for Advanced Discriminative Modeling, which can be easily interfaced with Python. This has been used for classification tasks in various areas of natural language processing. You also have a pick of a number of different models. I'd suggest starting with Maximum Entropy classification so long as you're already familiar with implementing a Naive Bayes classifier. If not, you may want to look into it and code one up to really get a decent understanding of statistical classification as a machine learning task.

得克萨斯大学奥斯汀分校的计算语言学小组开设了课程,其中大多数项目都使用了这一出色的工具.您可以查看计算语言学II 的课程页面以获得关于如何使其工作以及以前服务于哪些应用程序的想法.

The University of Texas at Austin computational linguistics groups have held classes where most of the projects coming out of them have used this great tool. You can look at the course page for Computational Linguistics II to get an idea of how to make it work and what previous applications it has served.

Mallet 是同样有用的另一个好工具. Mallet之间的区别在于,有更多的文档和更多可用的模型(例如决策树),并且使用Java,在我看来,这会使它变慢一些. Weka 是一整套包含不同机器学习模型的大包装,其中包括一些图形化的东西,但这实际上主要是出于教学目的,并不是我要投入生产的东西.

Another great tool which works in the same vein is Mallet. The difference between Mallet is that there's a bit more documentation and some more models available, such as decision trees, and it's in Java, which, in my opinion, makes it a little slower. Weka is a whole suite of different machine learning models in one big package that includes some graphical stuff, but it's really mostly meant for pedagogical purposes, and isn't really something I'd put into production.

祝您工作顺利.真正困难的部分可能是您需要预先进行知识工程的数量,以便您对模型可以从中学习的种子集"进行分类.它需要相当大,这取决于您是在进行二进制分类(快乐还是悲伤)或整个情绪范围(这将需要更多).确保保留一些工程数据以进行测试,或者运行十倍测试或删除一个测试,以确保在将数据发布之前,您实际上在预测方面做得很好.最重要的是,玩得开心!我认为这是NLP和AI的最好部分.

Good luck with your task. The real difficult part will probably be the amount of knowledge engineering required up front for you to classify the 'seed set' off of which your model will learn. It needs to be pretty sizeable, depending on whether you're doing binary classification (happy vs sad) or a whole range of emotions (which will require even more). Make sure to hold out some of this engineered data for testing, or run some tenfold or remove-one tests to make sure you're actually doing a good job predicting before you put it out there. And most of all, have fun! This is the best part of NLP and AI, in my opinion.

这篇关于Twitter在Python中的情感分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆