sklearn管道的并行化 [英] Pararelization of sklearn Pipeline

查看:325
本文介绍了sklearn管道的并行化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组管道,并且希望拥有多线程体系结构.我的典型管道如下所示:

I have a set of Pipelines and want to have multi-threaded architecture. My typical Pipeline is shown below:

huber_pipe = Pipeline([
        ("DATA_CLEANER", DataCleaner()),
        ("DATA_ENCODING", Encoder(encoder_name='code')),
        ("SCALE", Normalizer()),
        ("FEATURE_SELECTION", huber_feature_selector),
        ("MODELLING", huber_model)
    ])

是否可以在不同的线程或内核中运行管道的步骤?

Is it possible to run the steps of the pipeline in different threads or cores?

推荐答案

通常,不会.

如果您查看sklearn阶段的界面,则方法为:

If you look at the interface for sklearn stages, the methods are of the form:

fit(X, y, other_stuff)

predict(X)

也就是说,它们适用于整个数据集,并且不能对数据流(或分块流)进行增量学习.

That is, they work on the entire dataset, and can't do incremental learning on streams (or chunked streams) of data.

此外,从根本上讲,某些算法不适合于此.考虑一下您的舞台

Moreover, fundamentally, some of the algorithms are not amenable to this. Consider for example your stage

("SCALE", Normalizer()),

大概这是使用均值和/或方差归一化的.没有看到整个数据集,它怎么知道这些?因此,它必须在操作之前等待整个输入,因此不能与之后的各个阶段并行运行.大多数(如果不是全部)阶段都是这样.

Presumably, this normalizes using mean and/or variance. Without seeing the entire dataset, how can it know these things? It must therefore wait for the entire input before operating, and hence can't be run in parallel with the stages after it. Most (if not nearly all) stages are like that.

但是,在某些情况下,您仍然可以将多核与sklearn一起使用.

However, in some cases, you still can use multicores with sklearn.

  1. 某些阶段具有 n_jobs参数.像这样的阶段相对于其他阶段按顺序使用,但是可以并行化其中的工作.

  1. Some stages have an n_jobs parameter. Stages like this use sequentially relative to other stages, but can parallelize the work within.

在某些情况下,您可以滚动自己自己(近似)其他阶段的并行版本.例如,在给定任何回归阶段的情况下,您可以将其包装在一个阶段中,该阶段将数据随机分成 n 个部分,并行学习各部分,然后输出所有回归器的平均值的回归器. YMMV.

In some cases you can roll your own (approximate) parallel versions of other stages. E.g., given any regressor stage, you can wrap it in a stage that randomly chunks your data into n parts, learns the parts in parallel, and outputs a regressor that is the average of all the regressors. YMMV.

这篇关于sklearn管道的并行化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆