SKLearn交叉验证:如何将折叠示例的信息传递给我的计分员功能? [英] SKLearn cross-validation: How to pass info on fold examples to my scorer function?

查看:54
本文介绍了SKLearn交叉验证:如何将折叠示例的信息传递给我的计分员功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建一个自定义评分器功能,以交叉验证scikit-learn(Python)中的(二进制分类)模型.

我的原始测试数据的一些示例:

Source   Feature1   Feature2   Feature3
 123        0.1        0.2        0.3
 123        0.4        0.5        0.6
 456        0.7        0.8        0.9

假设任何折叠可能包含来自同一来源的多个测试示例...

然后,对于具有相同来源的示例集,我希望我的自定义计分员决定"优胜者"成为模型向其吐出更高概率的示例.换句话说,每个来源只能有一个正确的预测,但是如果我的模型声称有多个评估示例是正确的"(标签= 1),那么我希望将具有最高概率的示例与事实相匹配我的得分手.

我的问题是计分器功能需要签名:

score_func(y_true, y_pred, **kwargs)

其中y_truey_pred仅包含概率/标签.

但是,我真正需要的是:

score_func(y_true_with_source, y_pred_with_source, **kwargs)

所以我可以将y_pred_with_source实例按来源分类,并选择赢家与y_true_with_source真相匹配.然后,例如,我可以继续计算精度.

我是否可以通过某种方式传递此信息?也许例子的索引?

解决方案

听起来您在这里有一个学习排名问题.您正在尝试从每组实例中找到排名最高的实例. scikit-learn目前不直接支持等级学习-scikit-learn在很大程度上假设i.i.d.实例-因此您必须做一些额外的工作.

我认为我的第一个建议是降低API级别并使用交叉验证迭代器.那只会产生训练和验证倍数的索引.您可以使用这些索引对数据进行子集处理,并在子集上调用fitpredict,并删除 Source ,然后使用 Source 列对其进行评分.

您可能可以将其破解为cross_val_score方法,但难度更高.在scikit-learn中,上面显示的得分函数与

My problem is that the scorer function requires the signature:

score_func(y_true, y_pred, **kwargs)

where y_true and y_pred contain the probability/label only.

However, what I really need is:

score_func(y_true_with_source, y_pred_with_source, **kwargs)

so I can group the y_pred_with_source examples by their source and choose the winner to match against that of the y_true_with_source truth. Then I can carry on to calculate my precision, for example.

Is there a way I can pass in this information in some way? Maybe the examples' indices?

解决方案

It sounds like you have a learning-to-rank problem here. You are trying to find the highest-ranked instance out of each group of instances. Learning-to-rank isn't directly supported in scikit-learn right now - scikit-learn pretty much assumes i.i.d. instances - so you'll have to do some extra work.

I think my first suggestion is to drop down a level in the API and use the cross-validation iterators. That would just generate indices for training and validation folds. You would subset your data with those indices and call fit and predict on the subsets, with Source removed, and then score it using the Source column.

You can probably hack it in to the cross_val_score approach, but its trickier. In scikit-learn there is a distinction between the score function, which is what you showed above, and the scoring object (which can be a function) taken by cross_val_score. The scoring object is a callable object or function which has signature scorer(estimator, X, y). It looks to me like you can define a scoring object that works for your metric. You just have to remove the Source column before sending data to the estimator, and then use that column when computing your metric. If you go this route, I think you will have to wrap the classifier, too, so that its fit method skips the Source column.

Hope that helps... Good luck!

这篇关于SKLearn交叉验证:如何将折叠示例的信息传递给我的计分员功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆