我应该使用“random.seed"还是“numpy.random.seed"来控制“scikit-learn"中的随机数生成? [英] Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?

查看:22
本文介绍了我应该使用“random.seed"还是“numpy.random.seed"来控制“scikit-learn"中的随机数生成?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scikit-learn 和 numpy,我想设置全局种子,以便我的工作可重现.

I'm using scikit-learn and numpy and I want to set the global seed so that my work is reproducible.

我应该使用 numpy.random.seed 还是 random.seed?

Should I use numpy.random.seed or random.seed?

从评论中的链接,我了解到它们是不同的,并且 numpy 版本不是线程安全的.我想具体知道使用哪个来创建 IPython notebooks 以进行数据分析.scikit-learn 的一些算法涉及生成随机数,我想确保笔记本在每次运行时显示相同的结果.

From the link in the comments, I understand that they are different, and that the numpy version is not thread-safe. I want to know specifically which one to use to create IPython notebooks for data analysis. Some of the algorithms from scikit-learn involve generating random numbers, and I want to be sure that the notebook shows the same results on every run.

推荐答案

我应该使用 np.random.seed 还是 random.seed?

Should I use np.random.seed or random.seed?

这取决于您在代码中使用的是 numpy 的随机数生成器还是 random 中的随机数生成器.

That depends on whether in your code you are using numpy's random number generator or the one in random.

numpy.randomrandom 中的随机数生成器具有完全独立的内部状态,因此 numpy.random.seed() 不会影响random.random()产生的随机序列,同样random.seed()不会影响numpy.random.randn()等.如果您在代码中同时使用 randomnumpy.random ,那么您需要分别为两者设置种子.

The random number generators in numpy.random and random have totally separate internal states, so numpy.random.seed() will not affect the random sequences produced by random.random(), and likewise random.seed() will not affect numpy.random.randn() etc. If you are using both random and numpy.random in your code then you will need to separately set the seeds for both.

您的问题似乎专门针对 scikit-learn 的随机数生成器.据我所知,scikit-learn 始终使用 numpy.random,因此您应该使用 np.random.seed() 而不是 random.seed().

Your question seems to be specifically about scikit-learn's random number generators. As far as I can tell, scikit-learn uses numpy.random throughout, so you should use np.random.seed() rather than random.seed().

一个重要的警告是 np.random 不是线程安全的 - 如果你设置了一个全局种子,然后启动几个子进程并使用 np.random 在它们中生成随机数,每个子进程都将从其父进程继承 RNG 状态,这意味着您将在每个子进程中获得相同的随机变量.解决这个问题的常用方法是将不同的种子(或 numpy.random.Random 实例)传递给每个子进程,这样每个子进程都有一个单独的本地 RNG 状态.

One important caveat is that np.random is not threadsafe - if you set a global seed, then launch several subprocesses and generate random numbers within them using np.random, each subprocess will inherit the RNG state from its parent, meaning that you will get identical random variates in each subprocess. The usual way around this problem is to pass a different seed (or numpy.random.Random instance) to each subprocess, such that each one has a separate local RNG state.

由于 scikit-learn 的某些部分可以使用 joblib 并行运行,您将看到某些类和函数可以选择传递种子或 np.random.RandomState 实例(例如random_state= 参数 sklearn.decomposition.MiniBatchSparsePCA).我倾向于为脚本使用单个全局种子,然后根据全局种子为任何并行函数生成新的随机种子.

Since some parts of scikit-learn can run in parallel using joblib, you will see that some classes and functions have an option to pass either a seed or an np.random.RandomState instance (e.g. the random_state= parameter to sklearn.decomposition.MiniBatchSparsePCA). I tend to use a single global seed for a script, then generate new random seeds based on the global seed for any parallel functions.

这篇关于我应该使用“random.seed"还是“numpy.random.seed"来控制“scikit-learn"中的随机数生成?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆