我应该使用"random.seed"还是"numpy.random.seed"来控制"scikit-learn"中的随机数生成? [英] Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?

查看:135
本文介绍了我应该使用"random.seed"还是"numpy.random.seed"来控制"scikit-learn"中的随机数生成?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scikit-learn和numpy,并且我想设置全局种子,以便我的工作可重现.

I'm using scikit-learn and numpy and I want to set the global seed so that my work is reproducible.

我应该使用numpy.random.seed还是random.seed?

通过注释中的链接,我知道它们是不同的,并且numpy版本不是线程安全的.我想特别知道要使用哪一个来创建用于数据分析的IPython笔记本. scikit-learn的某些算法涉及生成随机数,我想确保笔记本在每次运行时都显示相同的结果.

From the link in the comments, I understand that they are different, and that the numpy version is not thread-safe. I want to know specifically which one to use to create IPython notebooks for data analysis. Some of the algorithms from scikit-learn involve generating random numbers, and I want to be sure that the notebook shows the same results on every run.

推荐答案

我应该使用np.random.seed还是random.seed?

Should I use np.random.seed or random.seed?

这取决于您在代码中使用的是numpy的随机数生成器还是random中的一个.

That depends on whether in your code you are using numpy's random number generator or the one in random.

numpy.randomrandom中的随机数生成器具有完全独立的内部状态,因此numpy.random.seed()不会影响random.random()产生的随机序列,同样,random.seed()不会影响numpy.random.randn()等.如果您在代码中同时使用了randomnumpy.random,则需要分别设置两者的种子.

The random number generators in numpy.random and random have totally separate internal states, so numpy.random.seed() will not affect the random sequences produced by random.random(), and likewise random.seed() will not affect numpy.random.randn() etc. If you are using both random and numpy.random in your code then you will need to separately set the seeds for both.

您的问题似乎专门关于scikit-learn的随机数生成器.据我所知,scikit-learn始终使用numpy.random,因此您应该使用np.random.seed()而不是random.seed().

Your question seems to be specifically about scikit-learn's random number generators. As far as I can tell, scikit-learn uses numpy.random throughout, so you should use np.random.seed() rather than random.seed().

一个重要的警告是np.random不是线程安全的-如果设置全局种子,然后启动几个子进程并使用np.random在其中生成随机数,则每个子进程将从其父级继承RNG状态,这意味着您将在每个子流程中获得相同的随机变量.解决此问题的常用方法是将不同的种子(或numpy.random.Random实例)传递给每个子进程,以使每个子进程都有一个单独的本地RNG状态.

One important caveat is that np.random is not threadsafe - if you set a global seed, then launch several subprocesses and generate random numbers within them using np.random, each subprocess will inherit the RNG state from its parent, meaning that you will get identical random variates in each subprocess. The usual way around this problem is to pass a different seed (or numpy.random.Random instance) to each subprocess, such that each one has a separate local RNG state.

由于scikit-learn的某些部分可以使用joblib并行运行,因此您将看到某些类和函数可以选择将种子或np.random.RandomState实例(例如,将random_state=参数传递给 sklearn.decomposition.MiniBatchSparsePCA ).我倾向于将单个全局种子用于脚本,然后基于全局种子为任何并行函数生成新的随机种子.

Since some parts of scikit-learn can run in parallel using joblib, you will see that some classes and functions have an option to pass either a seed or an np.random.RandomState instance (e.g. the random_state= parameter to sklearn.decomposition.MiniBatchSparsePCA). I tend to use a single global seed for a script, then generate new random seeds based on the global seed for any parallel functions.

这篇关于我应该使用"random.seed"还是"numpy.random.seed"来控制"scikit-learn"中的随机数生成?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆