我应该使用"random.seed"还是"numpy.random.seed"来控制"scikit-learn"中的随机数生成? [英] Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?
问题描述
我正在使用scikit-learn和numpy,并且我想设置全局种子,以便我的工作可重现.
I'm using scikit-learn and numpy and I want to set the global seed so that my work is reproducible.
我应该使用numpy.random.seed
还是random.seed
?
通过注释中的链接,我知道它们是不同的,并且numpy版本不是线程安全的.我想特别知道要使用哪一个来创建用于数据分析的IPython笔记本. scikit-learn的某些算法涉及生成随机数,我想确保笔记本在每次运行时都显示相同的结果.
From the link in the comments, I understand that they are different, and that the numpy version is not thread-safe. I want to know specifically which one to use to create IPython notebooks for data analysis. Some of the algorithms from scikit-learn involve generating random numbers, and I want to be sure that the notebook shows the same results on every run.
推荐答案
我应该使用np.random.seed还是random.seed?
Should I use np.random.seed or random.seed?
这取决于您在代码中使用的是numpy的随机数生成器还是random
中的一个.
That depends on whether in your code you are using numpy's random number generator or the one in random
.
numpy.random
和random
中的随机数生成器具有完全独立的内部状态,因此numpy.random.seed()
不会影响random.random()
产生的随机序列,同样,random.seed()
不会影响numpy.random.randn()
等.如果您在代码中同时使用了random
和numpy.random
,则需要分别设置两者的种子.
The random number generators in numpy.random
and random
have totally separate internal states, so numpy.random.seed()
will not affect the random sequences produced by random.random()
, and likewise random.seed()
will not affect numpy.random.randn()
etc. If you are using both random
and numpy.random
in your code then you will need to separately set the seeds for both.
您的问题似乎专门关于scikit-learn的随机数生成器.据我所知,scikit-learn始终使用numpy.random
,因此您应该使用np.random.seed()
而不是random.seed()
.
Your question seems to be specifically about scikit-learn's random number generators. As far as I can tell, scikit-learn uses numpy.random
throughout, so you should use np.random.seed()
rather than random.seed()
.
一个重要的警告是np.random
不是线程安全的-如果设置全局种子,然后启动几个子进程并使用np.random
在其中生成随机数,则每个子进程将从其父级继承RNG状态,这意味着您将在每个子流程中获得相同的随机变量.解决此问题的常用方法是将不同的种子(或numpy.random.Random
实例)传递给每个子进程,以使每个子进程都有一个单独的本地RNG状态.
One important caveat is that np.random
is not threadsafe - if you set a global seed, then launch several subprocesses and generate random numbers within them using np.random
, each subprocess will inherit the RNG state from its parent, meaning that you will get identical random variates in each subprocess. The usual way around this problem is to pass a different seed (or numpy.random.Random
instance) to each subprocess, such that each one has a separate local RNG state.
由于scikit-learn的某些部分可以使用joblib并行运行,因此您将看到某些类和函数可以选择将种子或np.random.RandomState
实例(例如,将random_state=
参数传递给 sklearn.decomposition.MiniBatchSparsePCA
).我倾向于将单个全局种子用于脚本,然后基于全局种子为任何并行函数生成新的随机种子.
Since some parts of scikit-learn can run in parallel using joblib, you will see that some classes and functions have an option to pass either a seed or an np.random.RandomState
instance (e.g. the random_state=
parameter to sklearn.decomposition.MiniBatchSparsePCA
). I tend to use a single global seed for a script, then generate new random seeds based on the global seed for any parallel functions.
这篇关于我应该使用"random.seed"还是"numpy.random.seed"来控制"scikit-learn"中的随机数生成?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!