在线对话文本大集合的情感分析 [英] Sentiment Analysis on LARGE collection of online conversation text

查看:318
本文介绍了在线对话文本大集合的情感分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

标题说明了一切;我有一个SQL数据库,上面有在线对话文本.我已经在Python中完成了该项目的大部分工作,因此我想使用Python的NLTK库进行此操作(除非有 strong 理由不这样做).

The title says it all; I have an SQL database bursting at the seams with online conversation text. I've already done most of this project in Python, so I would like to do this using Python's NLTK library (unless there's a strong reason not to).

数据由主题用户名发布组织.每个线程或多或少都专注于讨论我有兴趣分析的类别的一个产品".最终,当完成此操作后,我希望获得每个用户对他们在某个时候讨论过的任何产品的估计意见(喜欢/不喜欢这种交易).

The data is organized by Thread, Username, and Post. Each thread more or less focuses on discussing one "product" of the Category that I am interested in analyzing. Ultimately, when this is finished, I would like to have an estimated opinion (like/dislike sort of deal) from each user for any of the products they had discussed at some point.

所以,我想知道的是

1)如何确定每个线程所要使用的产品?我正在阅读有关关键字提取的信息...这是正确的方法吗?

1) How can I go about determining what product each thread is about? I was reading about keyword extraction... is that the correct method?

2)如何根据他们的帖子确定特定的用户情感?从我的有限理解出发,我必须先训练" NLTK以识别某些观点指标,然后再简单地确定这些单词出现在文本中的上下文?

2) How do I determine a specific users sentiment based on their posts? From my limited understanding, I must first "train" NLTK to recognize certain indicators of opinion, and then do I simply determine the context of those words when they appear in the text?

您可能已经猜到了,我以前没有使用NLP的经验.从目前为止的阅读情况来看,我认为我仍然可以进行学习.如果有人可以向我指出正确的方向,那么即使现在只是基本的粗略工作模型也将是不错的选择. Google对我不是很有帮助.

As you may have guessed by now, I have no prior experience with NLP. From my reading so far, I think I can handle learning it though. Even just a basic and crude working model for now would be great if someone can point me in the right direction. Google was not very helpful to me.

建议.我有权分析此数据(以防万一)

P.S. I have permission to analyze this data (in case it matters)

推荐答案

训练任何分类器都需要标记数据的训练集 和特征提取器以获得每个文本的功能集.拥有训练有素的分类器之后,您可以将其应用到以前看不见的文本(未标记),并根据所使用的机器学习算法获得分类. NLTK 给出了很好的解释,并提供了一些示例供您使用.

Training any classifier requires a training set of labeled data and a feature extractor to obtain feature sets for each text. After you have a trained classifier, you can apply it to previously unseen text (unlabeled) and obtain a classification based on the machine learning algorithm used. NLTK gives a good explanation and some samples to play around with.

如果您有兴趣使用自己的训练数据集为正面/负面情绪建立分类器,则应避免使用简单的关键字计数,例如,由于多种原因,它们不准确(例如,否定肯定词:不高兴"). 远程监控是一种替代方法,您仍然可以使用大型训练集而无需手动标记任何内容.基本上,这种方法使用表情符号或其他特定的文本元素作为嘈杂的标签.您仍然必须选择哪些功能是相关的,但是许多研究仅使用 unigrams bigrams (分别为单个单词或单词对)即可取得良好的结果.

If you are interested in building a classifier for positive/negative sentiment, using your own training dataset, I would avoid simple keyword counts, as they aren't accurate for a number of reasons (eg. negation of positive words: "not happy"). An alternative, where you can still use a large training set without having to manually label anything, is distant supervision. Basically, this approach uses emoticons or other specific text elements as noisy labels. You still have to choose which features are relevant but many studies have had good results with simply using unigrams or bigrams (individual words or pairs of words, respectively).

所有这些都可以使用Python和NLTK相对容易地完成.您还可以选择使用 NLTK-trainer 之类的工具,该工具是NLTK的包装,并且需要更少的代码.

All of this can be done relatively easily with Python and NLTK. You can also choose to use a tool like NLTK-trainer, which is a wrapper for NLTK and requires less code.

我认为Go等人的这项研究.是最容易理解的之一.您还可以阅读远程监管情感分析.

I think this study by Go et al. is one of the easiest to understand. You can also read other studies for distant supervision, distant supervision sentiment analysis, and sentiment analysis.

NLTK中有一些内置的分类器,同时具有训练和分类方法( SVM类,但实际上它只是 PySVMLight ,它本身是 SVMLight ,用C语言编写.尽管如此,我在使用这种方法时遇到了很多问题,因此建议使用

There are a few built-in classifiers in NLTK with both training and classification methods (Naive Bayes, MaxEnt, etc.) but if you are interested in using Support Vector Machines (SVM) then you should look elsewhere. Technically NLTK provides you with an SVM class but its really just a wrapper for PySVMLight, which itself is a wrapper for SVMLight, written in C. I had numerous problems with this approach though, and would instead recommend LIBSVM.

许多主题使用简单的关键字来确定主题,但还有一些更复杂的方法可用.

For determining the topic, many have used simple keywords but there are some more complex methods available.

这篇关于在线对话文本大集合的情感分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆