有关创建斯坦福CoreNLP培训模型的问题 [英] Questions about creating stanford CoreNLP training models

查看:157
本文介绍了有关创建斯坦福CoreNLP培训模型的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在与斯坦福大学的coreNLP合作,对我拥有的某些数据进行情感分析,并且正在创建一个训练模型.我知道我们可以使用以下命令创建训练模型:

I've been working with Stanford's coreNLP to perform sentiment analysis on some data I have and I'm working on creating a training model. I know we can create a training model with the following command:

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath     dev.txt -train -model model.ser.gz

我知道train.txt文件中的内容.您为句子评分,然后将其放入train.txt中,如下所示: (0 (2 Today) (0 (0 (2 is) (0 (2 a) (0 (0 bad) (2 day)))) (..)))

I know what goes in the train.txt file. You score sentences and put them in train.txt, something like this: (0 (2 Today) (0 (0 (2 is) (0 (2 a) (0 (0 bad) (2 day)))) (..)))

但是我不明白dev.txt文件中的内容. 我多次阅读了这个问题,试图尝试了解dev.txt中的内容,但我仍然不清楚.此外,手动为这些句子评分也很麻烦,是否有可用的工具使之更容易?我担心我使用了错误的括号或类似的其他愚蠢错误.

But I don't understand what goes in the dev.txt file. I read through this question several times to try to understand what goes in dev.txt, but it's still unclear to me. Also, scoring these sentences manually has become a pain, is there a tool available that makes it easier? I'm worried that I've been using the wrong number of parentheses or some other stupid mistake like that.

此外,关于train.txt文件应保留多长时间的任何建议?我正在考虑给1000个句子打分.这个数字是否太小或太大?

Also, any suggestions on how long my train.txt file should be? I'm thinking of scoring a 1000 sentences. Is that number too small, too large?

感谢您的所有帮助:)

推荐答案

  1. dev.txt应该与train.txt相同,只是句子不同.请注意,同一句子不应出现在dev.txt和train.txt中.开发集用于评估您在训练数据上训练的模型的质量.

  1. dev.txt should be the same as train.txt just with a different set of sentences. Note that the same sentence should not appear in dev.txt and train.txt. The development set is used to evaluate the quality of the model you train on the training data.

我们不分发用于标记情感数据的工具.此类可能对构建数据有帮助: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/BuildBinarizedDataset.html

We don't distribute a tool for tagging sentiment data. This class could be helpful in building data: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/BuildBinarizedDataset.html

以下是用于情感模型的train,dev和测试集的大小:train = 8544,dev = 1101,test = 2210

Here are the sizes of the train, dev, and test sets used for the sentiment model: train=8544, dev=1101, test=2210

这篇关于有关创建斯坦福CoreNLP培训模型的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆