如何训练斯坦福NLP情绪分析工具 [英] How to train the Stanford NLP Sentiment Analysis tool

查看:116
本文介绍了如何训练斯坦福NLP情绪分析工具的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

地狱大家!我正在使用Stanford Core NLP软件包,我的目标是在实时推文上进行情绪分析。

Hell everyone! I'm using the Stanford Core NLP package and my goal is to perform sentiment analysis on a live-stream of tweets.

使用情绪分析工具返回对文本态度的非常差的分析。许多阳性被标记为中性,许多阴性被评为正面。我已经在文本文件中获得了超过一百万条推文,但我不知道如何实际训练该工具并创建我自己的模型。

Using the sentiment analysis tool as is returns a very poor analysis of text's 'attitude' .. many positives are labeled neutral, many negatives rated positive. I've gone ahead an acquired well over a million tweets in a text file, but I haven't a clue how to actually train the tool and create my own model.

链接斯坦福情绪分析页面

可以使用以下命令使用PTB格式数据集重新训练模型:

"Models can be retrained using the following command using the PTB format dataset:"

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath     dev.txt -train -model model.ser.gz

来自dev.txt的样本(前4位表示极性超出5 ... 4/5正数)

Sample from dev.txt (The leading 4 represents polarity out of 5 ... 4/5 positive)

(4 (4 (2 A) (4 (3 (3 warm) (2 ,)) (3 funny))) (3 (2 ,) (3 (4 (4 engaging) (2 film)) (2 .))))

来自test.txt的样本

Sample from test.txt

(3 (3 (2 If) (3 (2 you) (3 (2 sometimes) (2 (2 like) (3 (2 to) (3 (3 (2 go) (2 (2 to) (2 (2 the) (2 movies)))) (3 (2 to) (3 (2 have) (4 fun))))))))) (2 (2 ,) (2 (2 Wasabi) (3 (3 (2 is) (2 (2 a) (2 (3 good) (2 (2 place) (2 (2 to) (2 start)))))) (2 .)))))

样本来自train.txt

Sample from train.txt

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

我还有两个问题。

每个文件之间的重要性和区别是什么? Train.txt / Dev.txt / Test.txt?

What is the significance and difference between each file? Train.txt/Dev.txt/Test.txt ?

我如何用一个充满推文的原始未解析文本文件训练我自己的模型?

How would I train my own model with a raw, unparsed text file full of tweets?

我对NLP很新,所以如果我缺少任何必要的信息或任何内容请批评!谢谢!

I'm very new to NLP so if I am missing any required information or anything at all please critique! Thank you!

推荐答案


每个文件之间的意义和区别是什么?
Train.txt / Dev.txt / Test.txt?

What is the significance and difference between each file? Train.txt/Dev.txt/Test.txt ?

这是标准的机器学习术语。火车组用于(惊喜)训练模型。开发集用于调整模型可能具有的任何参数。您通常要做的是选择参数值,在训练集上训练模型,然后检查训练模型在开发集上的表现。然后,您选择另一个参数值并重复。此过程可帮助您为模型找到合理的参数值。

This is standard machine learning terminology. The train set is used to (surprise surprise) train a model. The development set is used to tune any parameters the model might have. What you would normally do is pick a parameter value, train a model on the training set, and then check how well the trained model does on the development set. You then pick another parameter value and repeat. This procedure helps you find reasonable parameter values for your model.

完成此操作后,您将继续测试模型在测试集上的效果。这是看不见的 - 您的模型之前从未遇到过任何数据。重要的是测试集与训练和开发集分开,否则您将有效地评估之前看到的数据模型。这是错误的,因为它不会让你知道模型的真实程度。

Once this is done, you proceed to test how well the model does on the test set. This is unseen- your model has never encountered any of that data before. It is important that the test set is separate from the training and development set, otherwise you are effectively evaluating a model on data it has seen before. This would be wrong as it will not give you an idea of how well the model really does.


我如何训练我自己的模型一个原始的,未解析的文本文件,满是
推文?

How would I train my own model with a raw, unparsed text file full of tweets?

你不能,你不应该使用未解析的训练一套文件。递归深度模型的整个点(以及它执行得如此好的原因)是它可以从解析树的每个级别的情感注释中学习。您在上面给出的句子可以这样格式化:

You can't and you shouldn't train using an unparsed set of documents. The entire point of the recursive deep model (and the reason it performs so well) is that it can learn from the sentiment annotations at every level of the parse tree. The sentence you have given above can be formatted like this:


(4 
    (4 
        (2 A) 
        (4 
            (3 (3 warm) (2 ,)) (3 funny)
        )
    ) 
    (3 
        (2 ,) 
        (3 
            (4 (4 engaging) (2 film)) (2 .)
        )
    )
)

通常,情绪分析器使用文档级注释进行训练。您只有一个分数,此分数适用于整个文档,忽略了文档中的短语可能表达不同情绪的事实。斯坦福大学的团队花了很多精力来注释文档中的每个短语以表达情感。例如,单词电影本身就是中立的情绪:(2部电影)。然而,短语引人入胜的电影是非常积极的:(4(4个参与)(2部电影))(2。)

Usually, a sentiment analyser is trained with document-level annotations. You only have one score, and this score applies to the document as a whole, ignoring the fact that the phrases in the document may express different sentiment. The Stanford team put a lot of effort into annotating every phrase in the document for sentiment. For example, the word film on its own is neutral in sentiment: (2 film). However, the phrase engaging film is very positive: (4 (4 engaging) (2 film)) (2 .)

如果您标记了推文,则可以使用任何其他文档级情绪分类器。 stackoverflow上的情感分析标记我已经有了一些非常好的答案,我不打算在这里重复一遍。

If you have labelled tweets, you can use any other document-level sentiment classifier. The sentiment-analysis tag on stackoverflow already has some very good answers, I'm not going to repeat them here.

PS你有没有标记你的推文?所有100万人?如果你这样做了,我想为你付给你很多钱:)

PS Did you label the tweets you have? All 1 million of them? If you did, I'd like to pay you a lot of money for that file :)

这篇关于如何训练斯坦福NLP情绪分析工具的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆