良好的情绪分析数据集？ [英] Good dataset for sentiment analysis?

查看：702 发布时间：2017/4/2 12:32:17 dataset sentiment-analysis web-mining

本文介绍了良好的情绪分析数据集？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在进行情绪分析，我正在使用此链接中给出的数据集： http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html 我已将我的数据集分成50:50的比例。 50％用作测试样本，50％用作列车样本，并从火车样本中提取特征，并使用Weka分类器进行分类，但我的预测精度约为70-75％。

I am working on sentiment analysis and I am using dataset given in this link: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html and I have divided my dataset into 50:50 ratio. 50% are used as test samples and 50% are used as train samples and the features extracted from train samples and perform classification using Weka classifier, but my predication accuracy is about 70-75%.

任何人都可以建议一些其他数据集，这将有助于我增加结果 - 我已经使用unigram，bigram和POStags作为我的功能。

Can anybody can suggest some other dataset which will help me to increase the result - I have used unigram, bigram and POStags as my features.

推荐答案

有很多来源可以得到情绪分析数据集：

There are many sources to get sentiment analysis dataset:

来自google的巨大数据集 storage.googleapis.com/books/ngrams/books/datasetsv2.html

http://www.sananalytics.com/lab/twitter-sentiment/

http://inclass.kaggle.com/c/si650winter11/data

http ：//nlp.stanford.edu/sentiment/treebank.html

，或者您可以查看此全局ML数据集存储库： https://archive.ics.uci.edu/ml

huge ngrams dataset from google storage.googleapis.com/books/ngrams/books/datasetsv2.html
http://www.sananalytics.com/lab/twitter-sentiment/
http://inclass.kaggle.com/c/si650winter11/data
http://nlp.stanford.edu/sentiment/treebank.html
or you can look into this global ML dataset repository: https://archive.ics.uci.edu/ml

无论如何，这并不意味着它可以帮助您更准确地获取当前的数据集，因为语料库可能与您的数据集非常不同。除了降低测试百分比与培训之外，您还可以：使用半自动包装器（如CVParameterSelection或GridSearch）或甚至自动weka（如果适用），测试其他分类器或微调所有超参数。

Anyway, it does not mean it will help you to get a better accuracy for your current dataset because the corpus might be very different from your dataset. Apart from reducing the testing percentage vs training, you could: test other classifiers or fine tune all hyperparameters using semi-automated wrapper like CVParameterSelection or GridSearch, or even auto-weka if it fits.

使用50/50是相当罕见的，80/20是相当常见的比例。更好的做法是使用：60％用于培训，20％用于交叉验证，20％用于测试。

It is quite rare to use 50/50, 80/20 is quite a commonly occurring ratio. A better practice is to use: 60% for training, 20% for cross validation, 20% for testing.

这篇关于良好的情绪分析数据集？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

良好的情绪分析数据集？ [英] Good dataset for sentiment analysis?

问题描述

推荐答案

相关文章

其他数据库最新文章

热门教程

热门工具

登录关闭

良好的情绪分析数据集？ [英] Good dataset for sentiment analysis?

问题描述

推荐答案

相关文章

其他数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭