用于情绪分析的好数据集? [英] Good dataset for sentiment analysis?
问题描述
我正在进行情绪分析,我正在使用此链接中给出的数据集:http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html
我已将我的数据集划分为 50:50 的比例.50%作为测试样本,50%作为训练样本,从训练样本中提取特征并使用Weka分类器进行分类,但我的预测准确率在70-75%左右.
I am working on sentiment analysis and I am using dataset given in this link: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html
and I have divided my dataset into 50:50 ratio. 50% are used as test samples and 50% are used as train samples and the features extracted from train samples and perform classification using Weka classifier, but my predication accuracy is about 70-75%.
任何人都可以建议一些其他数据集来帮助我提高结果 - 我使用 unigram、bigram 和 POStags 作为我的特征.
Can anybody suggest some other datasets which can help me to increase the result - I have used unigram, bigram and POStags as my features.
推荐答案
获取情感分析数据集的来源很多:
There are many sources to get sentiment analysis dataset:
- 来自谷歌的巨大 ngrams 数据集 storage.googleapis.com/books/ngrams/books/datasetsv2.html
- http://www.sananalytics.com/lab/twitter-sentiment/
- http://inclass.kaggle.com/c/si650winter11/data
- http://nlp.stanford.edu/sentiment/treebank.html
- 或者您可以查看这个全球机器学习数据集存储库:https://archive.ics.uci.edu/ml
无论如何,这并不意味着它会帮助您获得更好的当前数据集准确性,因为语料库可能与您的数据集有很大不同.除了降低与训练相比的测试百分比之外,您还可以:测试其他分类器或使用 CVParameterSelection 或 GridSearch 等半自动包装器微调所有超参数,如果合适的话,甚至可以使用 auto-weka.
Anyway, it does not mean it will help you to get a better accuracy for your current dataset because the corpus might be very different from your dataset. Apart from reducing the testing percentage vs training, you could: test other classifiers or fine tune all hyperparameters using semi-automated wrapper like CVParameterSelection or GridSearch, or even auto-weka if it fits.
很少使用 50/50,80/20 是相当普遍的比率.更好的做法是使用:60% 用于训练,20% 用于交叉验证,20% 用于测试.
It is quite rare to use 50/50, 80/20 is quite a commonly occurring ratio. A better practice is to use: 60% for training, 20% for cross validation, 20% for testing.
这篇关于用于情绪分析的好数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!