火车中的正/负比例 [英] Positives/negatives proportion in train set

查看:108
本文介绍了火车中的正/负比例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取 Rocchio算法,以使相关反馈起作用.我有一个查询,还有一些标记为肯定和否定的文档.例如,我有60个正词和337个负词.我想使用此数据集的一部分训练我的模型(在这种情况下-调整查询),并在另一部分进行测试.但是,对于这种不平衡的数据集,我不确定要纳入训练集中的负数和正数.

I'm trying to get Rocchio algorithm for relevance feedback to work. I have a query, and a few documents marked positives and negatives. For example, I have 60 positives and 337 negatives. I want to train my model(in this case - adjust the query) using part of this dataset and test it on the other part. But having this kind of imbalanced dataset i'm not sure how many negatives and how many positives to take into training set.

另一个问题是,根据测试数据集中阳性/阴性的比例,我得到的误导性为Precision,Recall和F1评分结果.测试数据集中有49个正值和17个负值使我的Precision = 0.742,Recall = 1.000和F1 = 0.852,其中TP = 49,FP = 17,TN = 0,FN = 0.

Another problem is that depending on the positives/negatives proportion in test dataset I get misleading Precision, Recall and F1-score results. Having 49 positives and 17 negatives in test dataset gives me Precision=0.742, Recall=1.000 and F1=0.852, with number of TP=49, FP=17, TN=0, FN=0.

其他查询的正负比例分布并没有提示我为模型选择哪个比例.

Distribution of positives/negatives proportion for other queries doesnt give me any hint on which proportion to choose for my model.

因此,我要问您的是有关使用不平衡数据集以获取正确结果的一些建议.

So what im asking you for is some advice on working with imbalanced datasets to get correct results.

在此先感谢您,这样的菜鸟问题(-ish?):-)

Thanks in advance, sorry for such a noob(-ish?) question :-)

推荐答案

首先,我认为您的算法将很难从少量示例中进行概括(这取决于功能的数量以及当然).

First of all, I think that your algorithm will have a hard time generalizing from such a little number of examples (This depends on the number of features as well of course).

第二,我认为处理不平衡的数据集不是一个好主意. 您的算法似乎没有学到任何东西,因为它的输出始终为正". 这意味着,如果您的数据集平衡,则将具有50%的准确性.不太好... 如果找不到较大的数据集,建议您按以下方式拆分数据集:

Secondly, I don't think that it is a very good idea to work with an imbalanced dataset. It seems that your algorithm hasn't learned anything since its output is always "positive". This means that if your dataset was balanced you would have a 50% accuracy. Not too good... If you cannot find a larger dataset, I would suggest that you split yours as such:

  • 训练集(45个阳性/45个阴性)
  • 测试集(15个阳性/15个阴性)

无论如何,我仍然是一名学生,所以我是这样认为的,但是如果经验丰富的用户可以确认或确认,那将是很好的.

Anyway, I am still a student so that is what I think but it would be good if a more experienced user could confirm or infirm.

希望有帮助!

这篇关于火车中的正/负比例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆