这里有什么问题?意外引用现有实例而不是创建新实例 [英] What's wrong here? accidentally referencing an existing instance instead of making a new one

查看:50
本文介绍了这里有什么问题?意外引用现有实例而不是创建新实例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一名 R 用户,希望更熟悉 Python.我编写了一种迷你 API,可以轻松比较适合相同数据的不同统计模型,这样我就可以预先设置所有模型超参数,然后迭代不同的模型以适合它们.

这就是我想做的事情的本质:

  1. 围绕 Scikit-learn Classifier="nofollow noreferrer">Pipeline,反过来建立在 Scikit-learn 的内置估计器之一上,例如RandomForestClassifier
  2. 为这些未拟合的Classifier创建一个字典,以及一个不同的参数字典来循环
  3. 迭代两个字典,让每个未拟合的Classifier生成一个新实例底层管道,使用其[Pipeline.fit][1] 方法,并将新拟合的 Pipeline 保存在不同的字典中

然而,似乎在每次迭代中,不是生成管道的新实例,而是管道的同一个实例(或者可能是底层估算器)被改装.这是一个问题,因为 Pipeline.fit 方法就地修改了流水线(和底层估计器),因此之前迭代的拟合结果都被最终迭代的拟合结果覆盖.>

问题是我无法弄清楚这个父实例"是在哪里创建的以及它是如何被引用的.

具有可重现问题示例的基本设置位于 this Gist(在这里复制和粘贴有点太长了).我在最后添加了一个打印语句来说明这个问题.

对不起,如果这有点含糊,但我没有一个容易的时间来描述它.希望从示例中可以清楚地看出问题.

解决方案

问题在于results['0']['rf']results['1']['rf'] 实际上是同一个对象.因此,当您将管道放入循环中时:

results = dict()对于 features.keys() 中的 k:结果[k] = dict()对于classifiers.keys()中的m:打印(len(特征[k]))结果[k][m] = 分类器[m].fit(features[k], 'species', iris)

您正在重新安装已经合适的管道,丢失了您之前的工作.

为了解决这个问题,您需要在每次安装Classifier 时创建一个新实例.一种可能的方法是将 classifiers 字典从包含 Classifier 实例的字典更改为包含创建 Classifier 所需参数的字典:

分类器 = {'rf': (RandomForestClassifier, n_estimators=100, oob_score=True, bootstrap=True),'ab':(AdaBoostClassifier,n_estimators=50)}

现在,在您的循环中,您应该使用称为元组解包"的 Python 习惯用法来解包参数并为每个组合创建一个单独的 Classifier 实例

for k in features:结果[k] = dict()对于分类器中的 m:打印(len(特征[k]))分类器 = 分类器(*分类器 [m])结果[k][m] =classifier.fit(features[k],'species', iris)

请注意,要遍历字典的键,可以简单地编写 for key in dct:,而不是 for key in dct.keys().

I'm an R user looking to get more comfortable with Python. I wrote a kind of mini-API that makes it easy to compare different statistical models fitted to the same data, in such a way that I can pre-set all the model hyperparameters and then iterate over the different models in order to fit them.

This is the essence of what I want to do:

  1. Build a wrapper class Classifier around a Scikit-learn Pipeline, in turn built on one of Scikit-learn's built-in estimators, e.g. RandomForestClassifier
  2. Create a dictionary of these un-fitted Classifiers, and a different dictionary of parameters to loop over
  3. Iterate over both dictionaries, have each un-fitted Classifier generate a new instance of the underlying Pipeline, fit it using its [Pipeline.fit][1] method, and save the new, fitted Pipeline in a different dictionary

However, it seems that, instead of generating a new instance of the Pipeline, in each iteration, the same instance of the Pipeline (or maybe the underlying estimator) is being refitted. This is a problem because the Pipeline.fit method modifies the Pipeline (and underlying estimator) in place, so the fitted results from the previous iterations are all overwritten by the fitted results from the final iteration.

The problem is that I can't figure out where this "parent instance" is being created and how it's being referenced.

The basic setup with a reproducible example of the problem is in this Gist (it's a little too long to just copy and paste here). I added a print statement at the end to illustrate the issue.

Sorry if this is a little vague, but I'm not having an easy time describing it. Hopefully the issue is clear from the example.

解决方案

The problem is that results['0']['rf'] and results['1']['rf'] are in fact the same object. Therefore, when you fit the pipeline in your loop:

results = dict()
for k in features.keys():
    results[k] = dict()
    for m in classifiers.keys():
        print(len(features[k]))
        results[k][m] = classifiers[m].fit(features[k], 'species', iris)

You are re-fitting an already fit pipeline, losing your previous work.

To remedy this, you need to create a new instance of Classifier every time you fit it. One possible way to do this is to change your classifiers dictionary from one containing Classifier instances to one containing the arguments required to create a Classifier:

classifiers = {
    'rf': (RandomForestClassifier, n_estimators=100, oob_score=True, bootstrap=True),
    'ab': (AdaBoostClassifier, n_estimators=50)
}

Now, in your loop you should use a Python idiom known as "tuple unpacking" to unpack the arguments and create a separate Classifier instance for each combination

for k in features:
    results[k] = dict()
    for m in classifiers:
        print(len(features[k]))
        classifier = Classifier(*classifiers[m])
        results[k][m] = classifier.fit(features[k], 'species', iris)

Note that to iterate over the keys of a dictionary, one can simply write for key in dct:, as opposed to for key in dct.keys().

这篇关于这里有什么问题?意外引用现有实例而不是创建新实例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆