是否可以根据文本的结构来猜测用户的心情? [英] Is it possible to guess a user's mood based on the structure of text?

查看:14
本文介绍了是否可以根据文本的结构来猜测用户的心情?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我假设需要使用自然语言处理器来解析文本本身,但是对于根据用户编写的文本检测用户情绪的算法,您有什么建议?我怀疑它会非常准确,但我仍然很感兴趣.

I assume a natural language processor would need to be used to parse the text itself, but what suggestions do you have for an algorithm to detect a user's mood based on text that they have written? I doubt it would be very accurate, but I'm still interested nonetheless.

我绝不是语言学或自然语言处理方面的专家,所以如果这个问题过于笼统或愚蠢,我深表歉意.

I am by no means an expert on linguistics or natural language processing, so I apologize if this question is too general or stupid.

推荐答案

这是自然语言处理领域的基础,称为 情感分析.虽然你的问题很笼统,但肯定不是愚蠢的 - 例如亚马逊对产品评论中的文字进行了这种研究.

This is the basis of an area of natural language processing called sentiment analysis. Although your question is general, it's certainly not stupid - this sort of research is done by Amazon on the text in product reviews for example.

如果您对此很认真,那么可以通过以下方式实现一个简单的版本 -

If you are serious about this, then a simple version could be achieved by -

  1. 获取正面/负面情绪的语料库.如果这是一个专业项目,您可能需要一些时间自己手动注释语料库,但如果您赶时间或只是想一开始就进行实验,那么我建议您查看 情感极性语料库来自 Bo Pang 和 Lillian Lee 的研究.使用该语料库的问题在于它不是为您的领域量身定制的(具体来说,该语料库使用电影评论),但它应该仍然适用.

  1. Acquire a corpus of positive/negative sentiment. If this was a professional project you may take some time and manually annotate a corpus yourself, but if you were in a hurry or just wanted to experiment this at first then I'd suggest looking at the sentiment polarity corpus from Bo Pang and Lillian Lee's research. The issue with using that corpus is it is not tailored to your domain (specifically, the corpus uses movie reviews), but it should still be applicable.

将您的数据集拆分为正面或负面的句子.对于情感极性语料库,您可以将每个评论分成复合句子,然后将整体情感极性标签(正面或负面)应用于所有这些句子.将这个语料库分成两部分——90% 应该用于训练,10% 应该用于测试.如果您使用的是 Weka,那么它可以为您处理语料库的拆分.

Split your dataset into sentences either Positive or Negative. For the sentiment polarity corpus you could split each review into it's composite sentences and then apply the overall sentiment polarity tag (positive or negative) to all of those sentences. Split this corpus into two parts - 90% should be for training, 10% should be for test. If you're using Weka then it can handle the splitting of the corpus for you.

将机器学习算法(例如 SVM、朴素贝叶斯、最大熵)应用于单词级别的训练语料库.这个模型被称为词袋模型,它只是将句子表示为它所在的词由...组成的.这与许多垃圾邮件过滤器运行的模型相同.对于机器学习算法的一个很好的介绍,有一个名为 Weka 的应用程序,它实现了一系列这些算法,并为您提供一个 GUI 来使用它们.然后,您可以根据尝试使用此模型对测试语料库进行分类时出现的错误来测试机器学习模型的性能.

Apply a machine learning algorithm (such as SVM, Naive Bayes, Maximum Entropy) to the training corpus at a word level. This model is called a bag of words model, which is just representing the sentence as the words that it's composed of. This is the same model which many spam filters run on. For a nice introduction to machine learning algorithms there is an application called Weka that implements a range of these algorithms and gives you a GUI to play with them. You can then test the performance of the machine learned model from the errors made when attempting to classify your test corpus with this model.

将此机器学习算法应用于您的用户帖子.对于每个用户帖子,将帖子分成句子,然后使用您的机器学习模型对它们进行分类.

Apply this machine learning algorithm to your user posts. For each user post, separate the post into sentences and then classify them using your machine learned model.

所以,是的,如果您对此很认真,那么它是可以实现的 - 即使没有过去的计算语言学经验.这将是相当多的工作,但即使使用基于词的模型也可以获得良好的结果.

So yes, if you are serious about this then it is achievable - even without past experience in computational linguistics. It would be a fair amount of work, but even with word based models good results can be achieved.

如果您需要更多帮助,请随时与我联系 - 我总是很乐意帮助对 NLP 感兴趣的其他人 =]

If you need more help feel free to contact me - I'm always happy to help others interested in NLP =]

小笔记 -

  1. 仅仅将一段文本拆分成句子是 NLP 的一个领域 - 称为 句子边界检测.有许多工具(OSS 或免费)可用于执行此操作,但对于您的任务,对空格和标点符号进行简单拆分应该没问题.
  2. SVMlight 也是另一个需要考虑的机器学习器,事实上,他们的归纳 SVM 执行的任务与我们正在研究 - 尝试通过 1000 个正面示例和 1000 个负面示例对哪些 Reuter 文章与公司收购"进行分类.
  3. 将句子转化为特征进行分类可能需要一些工作.在这个模型中,每个单词都是一个特征——这需要对句子进行标记化,这意味着将单词和标点符号彼此分开.另一个技巧是将所有单独的单词标记小写,以便我恨你"和我恨你"最终被认为是相同的.有了更多数据,您可以尝试并包括大小写是否有助于对某人是否生气进行分类,但我认为单词至少对于最初的努力来说应该足够了.
  1. Merely splitting a segment of text into sentences is a field of NLP - called sentence boundary detection. There are a number of tools, OSS or free, available to do this, but for your task a simple split on whitespaces and punctuation should be fine.
  2. SVMlight is also another machine learner to consider, and in fact their inductive SVM does a similar task to what we're looking at - trying to classify which Reuter articles are about "corporate acquisitions" with 1000 positive and 1000 negative examples.
  3. Turning the sentences into features to classify over may take some work. In this model each word is a feature - this requires tokenizing the sentence, which means separating words and punctuation from each other. Another tip is to lowercase all the separate word tokens so that "I HATE you" and "I hate YOU" both end up being considered the same. With more data you could try and also include whether capitalization helps in classifying whether someone is angry, but I believe words should be sufficient at least for an initial effort.

<小时>

编辑

我刚刚发现 LingPipe 实际上有一个关于情绪的教程 使用我正在谈论的 Bo Pang 和 Lillian Lee 情绪极性语料库进行分析.如果您使用 Java,那可能是一个很好的工具,即使没有,它也会完成我上面讨论的所有步骤.

I just discovered LingPipe that in fact has a tutorial on sentiment analysis using the Bo Pang and Lillian Lee Sentiment Polarity corpus I was talking about. If you use Java that may be an excellent tool to use, and even if not it goes through all of the steps I discussed above.

这篇关于是否可以根据文本的结构来猜测用户的心情?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆