GridSearch上的模型和分类器 [英] GridSearch on Model and Classifiers

查看:174
本文介绍了GridSearch上的模型和分类器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚在这里在模型网格选择"中遇到了这个示例:

https://chrisalbon.com/machine_learning/model_selection/model_selection_using_grid_search/

问题:

示例内容为

# Create a pipeline
pipe = Pipeline([('classifier', RandomForestClassifier())])

# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'classifier': [LogisticRegression()],
                 'classifier__penalty': ['l1', 'l2'],
                 'classifier__C': np.logspace(0, 4, 10)},
                {'classifier': [RandomForestClassifier()],
                 'classifier__n_estimators': [10, 100, 1000],
                 'classifier__max_features': [1, 2, 3]}]lassifier', RandomForestClassifier())])

据我了解的代码,search_space包含使用的分类器及其参数.但是,我不了解Pipeline的目的以及为什么它包含RandomForestClassifier()?

背景: 在所需的工作流程中,我需要基于3个不同的分类器来训练doc2vec模型(gensim).模型和分类器都应将GridSearch应用于参数.我喜欢将结果存储在一个表中并保存最佳模型,即精度最高的模型.

解决方案

管道用于链接顺序数据转换模型,最后是分类器/回归器.类似于先使用TfidfVectorizer将文本转换为数字,然后训练分类器.

pipe = Pipeline([('vectorizer',TfidfVectorizer()), 
                 ('classifier', RandomForestClassifier())])

仅用于一个类,不需要管道.

在您的代码中,它用作占位符,以便可以通过使用'classifier'前缀来使用参数.并且classifier本身可以从参数中替换.

I just came across this example on Model Grid Selection here:

https://chrisalbon.com/machine_learning/model_selection/model_selection_using_grid_search/

Question:

The example reads

# Create a pipeline
pipe = Pipeline([('classifier', RandomForestClassifier())])

# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'classifier': [LogisticRegression()],
                 'classifier__penalty': ['l1', 'l2'],
                 'classifier__C': np.logspace(0, 4, 10)},
                {'classifier': [RandomForestClassifier()],
                 'classifier__n_estimators': [10, 100, 1000],
                 'classifier__max_features': [1, 2, 3]}]lassifier', RandomForestClassifier())])

As I understand the code, search_space contains the used classifiers and their parameters. However, I don't get what the purpose of Pipeline and why it contains RandomForestClassifier()?

Background: In my desired workflow, I need to train a doc2vec model (gensim), based on 3 different classifiers. Both the model and the classifiers should apply GridSearch to parameters. I like to store the results in a table and save the best model, that is the one with the highest accuracy.

解决方案

Pipeline is used to chain sequential data transformation models followed last by the classifier / regressor. Something like first converting the text to numbers using TfidfVectorizer and then training the classifier.

pipe = Pipeline([('vectorizer',TfidfVectorizer()), 
                 ('classifier', RandomForestClassifier())])

For only a single class, no need of Pipeline.

Here in your code, its used as a placeholder, so that the parameters can be used by using the 'classifier' prefix. And the classifier itself can be substituted from the params.

这篇关于GridSearch上的模型和分类器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆